This article contains the most important aspects of the method
$tabulate()
. This method aggregates sc_data
objects. The first part will use the
cancer dataset
from the
OGD article.
After that, other features of $tabulate()
will be
demonstrated with the data from the
structure of earnings survey (SES).
Notice that these tabulation methods can also be used with the
STATcube REST API. This means that objects created by
sc_table()
also have a $tabulate()
method.
Tabulating Data
Calling the $tabulate()
method with no arguments
produces a table with the same dimensions as $data
.
cancer$tabulate()
# A STATcubeR tibble: 46,479 x 5
`Tumore ICD/10 3-Steller` Reportin…¹ Province of r…² Sex Numbe…³
* <fct> <date> <fct> <fct> <int>
1 <C00> Bösartige Neubildung der Lippe 1983-01-01 "Burgenland " male 2
2 <C00> Bösartige Neubildung der Lippe 1983-01-01 "Carinthia" male 8
3 <C00> Bösartige Neubildung der Lippe 1983-01-01 "Carinthia" female 2
4 <C00> Bösartige Neubildung der Lippe 1983-01-01 "Lower Austria" male 6
5 <C00> Bösartige Neubildung der Lippe 1983-01-01 "Lower Austria" female 2
# … with 46,474 more rows, and abbreviated variable names ¹`Reporting year`,
# ²`Province of residence`, ³`Number of records F-KRE`
#> [1] TRUE
Instead of cancer$tabulate(...)
it is also possible to
use sc_tabulate(cancer, ...)
. All available parameters for
the $tabulate()
method are documented in
?sc_tabulate
.
Aggregation
Aggregating with sums
To get the number of cases by reporting year and sex, use the labels of those variables as arguments.
cancer$tabulate("Reporting year", "Sex")
# A STATcubeR tibble: 74 x 3
`Reporting year` Sex `Number of records F-KRE`
* <date> <fct> <int>
1 1983-01-01 male 14492
2 1983-01-01 female 17476
3 1984-01-01 male 14794
4 1984-01-01 female 17449
5 1985-01-01 male 14552
# … with 69 more rows
If more than one measure is included in the dataset, all measures
will be aggregated.
STATcubeR
uses rowsum()
to ensure a good performance with big
datasets. It is also possible to use partial matching or use codes.
cancer$tabulate("Reporting", "C-KRE")
# A STATcubeR tibble: 74 x 3
`Reporting year` Sex `Number of records F-KRE`
* <date> <fct> <int>
1 1983-01-01 male 14492
2 1983-01-01 female 17476
3 1984-01-01 male 14794
4 1984-01-01 female 17449
5 1985-01-01 male 14552
# … with 69 more rows
STATcubeR
will use pmatch()
to match the supplied strings with the
metadata to identify the variables that should be used for
aggregation.
Limitations of sums
In some cases, datasets cannot be aggregated using the
rowsum()
approach. As an example, take the structure of
earnings survey.
earnings <- od_table("OGD_veste309_Veste309_1")
earnings
#> Structure of Earnings Survey (SES) 2018 Gross hourly earnings in EUR
#> by citizenship, region (NUTS 2) and form of employment
#>
#> Dataset: OGD_veste309_Veste309_1 (data.statistik.gv.at)
#> Measures: Arithmetic mean, 1st quartile, 2nd quartile (median), 3rd
#> quartile, Number of employees
#> Fields: Sex <3>, Citizenship <9>, Region (NUTS2) <10>, Form of employment
#> <7>
#>
#> Request: [2022-12-20 11:38:27]
#> STATcubeR: 0.5.0.1
As we can see from the print()
output, the measures
contain means and quartiles. Therefore, aggregating the data via
rowsum()
is not meaningful. However, this dataset contains
a “total code” for every field.
earnings$tabulate()
# A STATcubeR tibble: 72 x 9
Sex Citizen…¹ `Region (NUTS2)` `Form of employment` Arith…²
* <fct> <fct> <fct> <fct> <dbl>
1 Sum total Total Total "Total" 17.6
2 Sum total Total Total "Standard employment " 19.0
3 Sum total Total Total "Non-standard employment (tot… 15.2
4 Sum total Total Total "Non-standard employment: par… 15.9
5 Sum total Total Total "Non-standard employment: fix… 16.8
6 Sum total Total Total "Non-standard employment: mar… 12
7 Sum total Total Total "Non-standard employment: tem… 14.0
8 Sum total Total AT11 Burgenland "Total" 15.7
9 Sum total Total AT12 Lower Austria "Total" 16.4
10 Sum total Total AT13 Vienna "Total" 19.2
# … with 62 more rows, 4 more variables: `1st quartile` <dbl>,
# `2nd quartile (median)` <dbl>, `3rd quartile` <dbl>,
# `Number of employees` <dbl>, and abbreviated variable names ¹Citizenship,
# ²`Arithmetic mean`
Aggregating via total codes
These total codes can be used to aggregate the data with
$tabulate()
. In order to do that, the total codes need to
be specified using $total_codes()
.
earnings$total_codes(Sex = "Sum total", Citizenship = "Total",
Region = "Total", `Form of employment` = "Total")
Now $tabulate()
will use these total codes to form
aggregates of the data.
earnings$tabulate("Form of employment")
# A STATcubeR tibble: 6 x 6
`Form of employment` Arith…¹ 1st q…² 2nd q…³ 3rd q…⁴ Numbe…⁵
* <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 "Standard employment " 19.0 12.8 16.2 21.7 1685788
2 "Non-standard employment (total)" 15.2 10.2 13.0 17.2 965150
3 "Non-standard employment: part-time o… 15.9 10.7 13.8 18.3 598012
4 "Non-standard employment: fixed-term … 16.8 10.2 13.4 18.6 125491
5 "Non-standard employment: marginal pa… 12 9 10.1 12.3 141213
6 "Non-standard employment: temporary a… 14.0 11.5 13.5 15.7 100435
# … with abbreviated variable names ¹`Arithmetic mean`, ²`1st quartile`,
# ³`2nd quartile (median)`, ⁴`3rd quartile`, ⁵`Number of employees`
As we can see, the method extracted rows 2 to 7 from the data. The logic for selecting those rows is equivalent to the following dplyr expression.
earnings$data %>% dplyr::filter(Sex == "Sum total" & Citizenship == "Total" &
`Region (NUTS2)` == "Total" & `Form of employment` != "Total") %>%
dplyr::select(-Sex, -Citizenship, -`Region (NUTS2)`)
The $tabulate()
method also works with more than one
variable.
earnings$tabulate("Sex", "Form of employment")
# A STATcubeR tibble: 12 x 7
Sex `Form of employment` Arith…¹ 1st q…² 2nd q…³ 3rd q…⁴ Numbe…⁵
* <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Male "Standard employment " 20.0 13.5 17.0 22.8 1225839
2 Male "Non-standard employment (tot… 16.5 10.5 13.6 18.0 318749
3 Male "Non-standard employment: par… 18.8 11.1 15.2 22.5 119316
4 Male "Non-standard employment: fix… 18.3 10.9 14.2 19.6 65377
5 Male "Non-standard employment: mar… 12.6 9 10.2 12.6 58175
6 Male "Non-standard employment: tem… 14.4 12 13.8 15.9 75880
7 Female "Standard employment " 16.3 11.1 14.3 18.8 459949
8 Female "Non-standard employment (tot… 14.6 10.0 12.8 16.8 646401
9 Female "Non-standard employment: par… 15.2 10.6 13.5 17.6 478696
10 Female "Non-standard employment: fix… 15.2 9.68 12.6 16.4 60114
11 Female "Non-standard employment: mar… 11.6 9.02 10 12.1 83037
12 Female "Non-standard employment: tem… 13.0 10.4 12.2 14.8 24554
# … with abbreviated variable names ¹`Arithmetic mean`, ²`1st quartile`,
# ³`2nd quartile (median)`, ⁴`3rd quartile`, ⁵`Number of employees`
earnings$tabulate("Sex", "Citizenship")
# A STATcubeR tibble: 16 x 7
Sex Citizenship Arith…¹ 1st q…² 2nd q…³ 3rd q…⁴ Numbe…⁵
* <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Male "Austrian" 20.4 13.8 17.4 23.5 1190627
2 Male "Non-Austrian " 15.3 10.6 13.2 16.6 353961
3 Male "EU-28/EWR/Switzerland (witho… 16.0 10.9 13.5 17.1 236607
4 Male "EU-15/EWR before 2004/Switze… 21.2 12.6 16.7 24.0 75300
5 Male "EU-13 EU Member States after… 13.6 10.3 12.7 15.1 161308
6 Male "Former Yugoslavia (without S… 14.1 11.2 13.6 16.1 56474
7 Male "Turkey" 13.3 10.4 12.8 15.4 22516
8 Male "Other" 14.0 9.07 11.2 14.6 38364
9 Female "Austrian" 15.9 11.1 14.0 18.3 889028
10 Female "Non-Austrian " 13.0 9.19 10.6 14.3 217322
11 Female "EU-28/EWR/Switzerland (witho… 13.7 9.38 11.3 15.5 143719
12 Female "EU-15/EWR before 2004/Switze… 17.1 11.0 14.7 19.7 47272
# … with 4 more rows, and abbreviated variable names ¹`Arithmetic mean`,
# ²`1st quartile`, ³`2nd quartile (median)`, ⁴`3rd quartile`,
# ⁵`Number of employees`
earnings$tabulate("Sex", "Region")
# A STATcubeR tibble: 18 x 7
Sex `Region (NUTS2)` `Arithmetic mean` 1st qua…¹ 2nd q…² 3rd q…³ Numbe…⁴
* <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Male AT11 Burgenland 16.6 11.6 14.2 18.1 35941
2 Male AT12 Lower Austria 17.8 12.3 15.4 19.9 257912
3 Male AT13 Vienna 20.9 12.2 16.7 24.8 353371
4 Male AT21 Carinthia 18.3 12.6 15.5 21.2 79189
5 Male AT22 Styria 18.4 12.8 15.9 21.1 219265
6 Male AT31 Upper Austria 19.7 13.7 17.1 22.5 295783
7 Male AT32 Salzburg 19.2 12.7 16.1 21.6 105792
8 Male AT33 Tyrol 18.6 12.8 16.2 21 128282
9 Male AT34 Vorarlberg 20.4 14.5 18.4 23.7 69053
10 Female AT11 Burgenland 14.5 10.2 12.6 16.7 29448
11 Female AT12 Lower Austria 14.2 10 12.7 16.1 164107
12 Female AT13 Vienna 17.1 10.6 14.6 20.2 280267
# … with 6 more rows, and abbreviated variable names ¹`1st quartile`,
# ²`2nd quartile (median)`, ³`3rd quartile`, ⁴`Number of employees`
earnings$tabulate("Citizenship", "Region")
# A STATcubeR tibble: 0 x 7
# … with 7 variables: Citizenship <fct>, Region (NUTS2) <fct>,
# Arithmetic mean <dbl>, 1st quartile <dbl>, 2nd quartile (median) <dbl>,
# 3rd quartile <dbl>, Number of employees <dbl>
We get an empty table because this cross tabulation is not included in the OGD dataset. The same will happen for Citizenship & Form of employment as well as Region & Form of employment.
#> [1] 0 7
#> [1] 0 7
Totals and the REST API
By default, STATcubeR will always add totals for datasets from the REST API and use those totals to aggregate the datasets.
x <- sc_table(sc_example("accomodation"))
x$meta$fields
# STATcubeR metadata: 3 x 7
code label total_code nitems type
<chr> <chr> <chr> <int> <chr>
1 C-SDB_TIT-0 Season/Tourism Month SC_TOTAL 273 Time (month)
2 C-C93SUM-0 Country of origin SC_TOTAL 4 Category
3 C-BBTR_REG2-0 Accomodation establishment SC_TOTAL 4 Category
# … with 2 more columns: 'label_de', 'label_en'
Including totals in the oputput
It is not necessary that all fields have totals. For example, suppose
we want to include the totals for Sex
in the output table.
We can just remove the toal code before running
sc_tabulate()
. The special symbol NA
can be
used to unset a total code.
earnings$total_codes(Sex = NA)
earnings$tabulate("Sex")
# A STATcubeR tibble: 3 x 6
Sex `Arithmetic mean` `1st quartile` 2nd quartile (med…¹ 3rd q…² Numbe…³
* <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Sum total 17.6 11.6 15.1 20.1 2650938
2 Male 19.2 12.8 16.3 22.0 1544588
3 Female 15.3 10.4 13.4 17.6 1106350
# … with abbreviated variable names ¹`2nd quartile (median)`, ²`3rd quartile`,
# ³`Number of employees`
German Labels and Codes
It is possible to switch the language used for labeling the data.
This can be done by setting $language
to "de"
or "en"
.
earnings$language <- "de"
earnings$tabulate("Geschlecht")
# A STATcubeR tibble: 3 x 6
Geschlecht `Arithmetisches Mittel` `1. Quartil` 2. Quartil (…¹ 3. Qu…² Zahl …³
* <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 insgesamt 17.6 11.6 15.1 20.1 2650938
2 männlich 19.2 12.8 16.3 22.0 1544588
3 weiblich 15.3 10.4 13.4 17.6 1106350
# … with abbreviated variable names ¹`2. Quartil (Median)`, ²`3. Quartil`,
# ³`Zahl d unselbst Beschäftigten`
To skip labelling altogether and use variable codes in the output,
use raw=TRUE
.
earnings$tabulate("Geschlecht", raw = TRUE)
# A STATcubeR tibble: 3 x 6
`C-A11-0` `F-VESTE_AM` `F-VESTE_Q25` `F-VESTE_Q50` `F-VESTE_Q75` `F-VESTE_UB`
* <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A11-1 17.6 11.6 15.1 20.1 2650938
2 A11-2 19.2 12.8 16.3 22.0 1544588
3 A11-3 15.3 10.4 13.4 17.6 1106350
Switching languages is always available for od_table()
objects. For sc_table()
, it depends on whcih languages were
requested.
# default: get labels in german and english
x <- sc_table(sc_example("accomodation"))
# only get english labels
x <- sc_table(sc_example("accomodation"), lang = "en")
# only get german labels
x <- sc_table(sc_example("accomodation"), lang = "de")
Subsetting columns
In the previous examples, we only supplied names and/or codes of
fields to sc_tabulate()
. It is also possible to include
measures in which case the unlisted measures will be omitted.
earnings$tabulate("Geschlecht", "Arithmetisches Mittel", "2. Quartil")
# A STATcubeR tibble: 3 x 3
Geschlecht `Arithmetisches Mittel` `2. Quartil (Median)`
* <fct> <dbl> <dbl>
1 insgesamt 17.6 15.1
2 männlich 19.2 16.3
3 weiblich 15.3 13.4
Just like for fields, measures also support partial matching and
codes. In the above example, "2. Quartil"
was matched to
"2. Quartil (Median)"
.
Programmatic usage
Notice that we used the german label for the column
"Sex"
in the last calls to tabulate()
. This is
necessary because only the “active” labels are available to define the
tabulation. If you want to use
STATcubeR
programatically, always use codes to define the tabulation and also use
the .list
parameter if you want to pass several codes.
earnings$field("C-A11-0")
# STATcubeR metadata: 3 x 10
code label parsed
<chr> <chr> <chr>
1 A11-1 insgesamt insgesamt
2 A11-2 männlich männlich
3 A11-3 weiblich weiblich
# … with 7 more columns: 'label_de', 'label_en', 'parent', 'de_desc', 'en_desc', 'visible', 'order'
earnings$total_codes(`C-A11-0` = "A11-1")
vars_to_tabulate <- c("C-A11-0", "C-BESCHV-0")
earnings$tabulate(.list = vars_to_tabulate)
# A STATcubeR tibble: 12 x 7
Geschlecht Form des Beschäftigungsverhältnis…¹ Arith…² 1. Qu…³ 2. Qu…⁴ 3. Qu…⁵
* <fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 männlich "Normalarbeitsverhältnis" 20.0 13.5 17.0 22.8
2 männlich "Atypische Beschäftigung (insgesam… 16.5 10.5 13.6 18.0
3 männlich "Atypisch: ausschließlich Teilzeit" 18.8 11.1 15.2 22.5
4 männlich "Atypisch: befristete Beschäftigun… 18.3 10.9 14.2 19.6
5 männlich "Atypisch: geringfügige Beschäftig… 12.6 9 10.2 12.6
6 männlich "Atypisch: Leih- und Zeitarbeit " 14.4 12 13.8 15.9
7 weiblich "Normalarbeitsverhältnis" 16.3 11.1 14.3 18.8
# … with 5 more rows, 1 more variable: `Zahl d unselbst Beschäftigten` <dbl>,
# and abbreviated variable names ¹`Form des Beschäftigungsverhältnisses`,
# ²`Arithmetisches Mittel`, ³`1. Quartil`, ⁴`2. Quartil (Median)`,
# ⁵`3. Quartil`
$total_codes()
currently uses an ellipsis
(...
) parameter to define total codes. In the future,
programmatic updates of sc_data
objects should be defined
in $recodes
. See #17.