For the most part, this document will present the functionalities of the function surveysd::calc.stError() which generates point estimates and standard errors for user-supplied estimation functions.

Prerequisites

In order to use a dataset with calc.stError(), several weight columns have to be present. Each weight column corresponds to a bootstrap sample. In the following examples, we will use the data from demo.eusilc() and attach the bootstrap weights using draw.bootstrap() and recalib(). Please refer to the documentation of those functions for more detail.

library(surveysd)

set.seed(1234)
eusilc <- demo.eusilc(prettyNames = TRUE)
dat_boot <- draw.bootstrap(eusilc, REP = 10, hid = "hid", weights = "pWeight",
                           strata = "region", period = "year")
dat_boot_calib <- recalib(dat_boot, conP.var = "gender", conH.var = "region",
                          epsP = 1e-2, epsH = 2.5e-2, verbose = FALSE)
dat_boot_calib[, onePerson := nrow(.SD) == 1, by = .(year, hid)]

## print part of the dataset
dat_boot_calib[1:5, .(year, povertyRisk, eqIncome, onePerson, pWeight, w1, w2, w3, w4, w5)]
##    year povertyRisk eqIncome onePerson  pWeight        w1        w2        w3
## 1: 2010       FALSE 16090.69     FALSE 504.5696 0.4477743 1007.5651 0.4466992
## 2: 2010       FALSE 16090.69     FALSE 504.5696 0.4477743 1007.5651 0.4466992
## 3: 2010       FALSE 16090.69     FALSE 504.5696 0.4477743 1007.5651 0.4466992
## 4: 2011       FALSE 16090.69     FALSE 504.5696 0.4237016  968.2711 0.4468813
## 5: 2011       FALSE 16090.69     FALSE 504.5696 0.4237016  968.2711 0.4468813
##           w4       w5
## 1: 0.4489062 1019.721
## 2: 0.4489062 1019.721
## 3: 0.4489062 1019.721
## 4: 0.4654705 1005.100
## 5: 0.4654705 1005.100

Estimator functions

The parameters fun and var in calc.stError() define the estimator to be used in the error analysis. There are two built-in estimator functions weightedSum() and weightedRatio() which can be used as follows.

povertyRate <- calc.stError(dat_boot_calib, var = "povertyRisk", fun = weightedRatio)
totalIncome <- calc.stError(dat_boot_calib, var = "eqIncome", fun = weightedSum)

Those functions calculate the ratio of persons at risk of poverty (in percent) and the total income. By default, the results are calculated separately for each reference period.

povertyRate$Estimates
##    year     n       N val_povertyRisk stE_povertyRisk
## 1: 2010 14827 8182222        14.44422       0.5838199
## 2: 2011 14827 8182222        14.77393       0.6771848
## 3: 2012 14827 8182222        15.04515       0.4907275
## 4: 2013 14827 8182222        14.89013       0.4756386
## 5: 2014 14827 8182222        15.14556       0.5195149
## 6: 2015 14827 8182222        15.53640       0.4808834
## 7: 2016 14827 8182222        15.08315       0.3566648
## 8: 2017 14827 8182222        15.42019       0.5592589
totalIncome$Estimates
##    year     n       N val_eqIncome stE_eqIncome
## 1: 2010 14827 8182222 162750998071    960851034
## 2: 2011 14827 8182222 161926931417    901915958
## 3: 2012 14827 8182222 162576509628   1198230549
## 4: 2013 14827 8182222 163199507862   1474893389
## 5: 2014 14827 8182222 163986275009   1528829971
## 6: 2015 14827 8182222 163416275447   1338217239
## 7: 2016 14827 8182222 162706205137   1207314432
## 8: 2017 14827 8182222 164314959107   1322744977

Columns that use the val_ prefix denote the point estimate belonging to the “main weight” of the dataset, which is pWeight in case of the dataset used here.

Columns with the stE_ prefix denote standard errors calculated with bootstrap replicates. The replicates result in using w1, w2, …, w10 instead of pWeight when applying the estimator.

n denotes the number of observations for the year and N denotes the total weight of those persons.

Custom estimators

In order to define a custom estimator function to be used in fun, the function needs to have two arguments like the example below.

## define custom estimator
myWeightedSum <- function(x, w) {
  sum(x*w)
}

## check if results are equal to the one using `suveysd::weightedSum()`
totalIncome2 <- calc.stError(dat_boot_calib, var = "eqIncome", fun = myWeightedSum)
all.equal(totalIncome$Estimates, totalIncome2$Estimates)
## [1] TRUE

The parameters x and w can be assumed to be vectors with equal length with w being numeric and x being the column defined in the var argument. It will be called once for each period (in this case year) and for each weight column (in this case pWeight, w1, w2, …, w10).

Multiple estimators

In case an estimator should be applied to several columns of the dataset, var can be set to a vector containing all necessary columns.

multipleRates <- calc.stError(dat_boot_calib, var = c("povertyRisk", "onePerson"), fun = weightedRatio)
multipleRates$Estimates
##    year     n       N val_povertyRisk stE_povertyRisk val_onePerson
## 1: 2010 14827 8182222        14.44422       0.5838199      14.85737
## 2: 2011 14827 8182222        14.77393       0.6771848      14.85737
## 3: 2012 14827 8182222        15.04515       0.4907275      14.85737
## 4: 2013 14827 8182222        14.89013       0.4756386      14.85737
## 5: 2014 14827 8182222        15.14556       0.5195149      14.85737
## 6: 2015 14827 8182222        15.53640       0.4808834      14.85737
## 7: 2016 14827 8182222        15.08315       0.3566648      14.85737
## 8: 2017 14827 8182222        15.42019       0.5592589      14.85737
##    stE_onePerson
## 1:     0.2572088
## 2:     0.2778282
## 3:     0.2909612
## 4:     0.3458258
## 5:     0.4394058
## 6:     0.3810024
## 7:     0.3105411
## 8:     0.3030293

Here we see the relative number of persons at risk of poverty and the relative number of one-person households.

Grouping

The groups argument can be used to calculate estimators for different subsets of the data. This argument can take the grouping variable as a string that refers to a column name (usually a factor) in dat. If set, all estimators are not only split by the reference period but also by the grouping variable. For simplicity, only one reference period of the above data is used.

dat2 <- subset(dat_boot_calib, year == 2010)
for (att  in c("period", "weights", "b.rep"))
  attr(dat2, att) <- attr(dat_boot_calib, att)

To calculate the ratio of persons at risk of poverty for each federal state of Austria, group = "region" can be used.

povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, group = "region")
povertyRates$Estimates
##     year     n       N        region val_povertyRisk stE_povertyRisk
##  1: 2010   549  260564    Burgenland        19.53984       1.8042764
##  2: 2010   733  377355    Vorarlberg        16.53731       3.2896891
##  3: 2010   924  535451      Salzburg        13.78734       2.2585866
##  4: 2010  1078  563648     Carinthia        13.08627       1.7469304
##  5: 2010  1317  701899         Tyrol        15.30819       1.3943715
##  6: 2010  2295 1167045        Styria        14.37464       1.3500431
##  7: 2010  2322 1598931        Vienna        17.23468       1.0428496
##  8: 2010  2804 1555709 Lower Austria        13.84362       1.7034375
##  9: 2010  2805 1421620 Upper Austria        10.88977       0.8709790
## 10: 2010 14827 8182222          <NA>        14.44422       0.5838199

The last row with region = NA denotes the aggregate over all regions. Note that the columns N and n now show the weighted and unweighted number of persons in each region.

Several grouping variables

In case more than one grouping variable is used, there are several options of calling calc.stError() depending on whether combinations of grouping levels should be regarded or not. We will consider the variables gender and region as our grouping variables and show three options on how calc.stError() can be called.

Option 1: All regions and all genders

Calculate the point estimate and standard error for each region and each gender. The number of rows in the output is therefore

\[n_\text{periods}\cdot(n_\text{regions} + n_\text{genders} + 1) = 1\cdot(9 + 2 + 1) = 12.\]

The last row is again the estimate for the whole period.

povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, 
                             group = c("gender", "region"))
povertyRates$Estimates
##     year     n       N gender        region val_povertyRisk stE_povertyRisk
##  1: 2010   549  260564   <NA>    Burgenland        19.53984       1.8042764
##  2: 2010   733  377355   <NA>    Vorarlberg        16.53731       3.2896891
##  3: 2010   924  535451   <NA>      Salzburg        13.78734       2.2585866
##  4: 2010  1078  563648   <NA>     Carinthia        13.08627       1.7469304
##  5: 2010  1317  701899   <NA>         Tyrol        15.30819       1.3943715
##  6: 2010  2295 1167045   <NA>        Styria        14.37464       1.3500431
##  7: 2010  2322 1598931   <NA>        Vienna        17.23468       1.0428496
##  8: 2010  2804 1555709   <NA> Lower Austria        13.84362       1.7034375
##  9: 2010  2805 1421620   <NA> Upper Austria        10.88977       0.8709790
## 10: 2010  7267 3979572   male          <NA>        12.02660       0.6303407
## 11: 2010  7560 4202650 female          <NA>        16.73351       0.6300745
## 12: 2010 14827 8182222   <NA>          <NA>        14.44422       0.5838199

Option 2: All combinations of state and gender

Split the data by all combinations of the two grouping variables. This will result in a larger output-table of the size

\[n_\text{periods}\cdot(n_\text{regions} \cdot n_\text{genders} + 1) = 1\cdot(9\cdot2 + 1)= 19.\]

povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, 
                             group = list(c("gender", "region")))
povertyRates$Estimates
##     year     n         N gender        region val_povertyRisk stE_povertyRisk
##  1: 2010   261  122741.8   male    Burgenland       17.414524       2.2332286
##  2: 2010   288  137822.2 female    Burgenland       21.432598       2.0899871
##  3: 2010   359  182732.9   male    Vorarlberg       12.973259       3.1093204
##  4: 2010   374  194622.1 female    Vorarlberg       19.883637       3.7576712
##  5: 2010   440  253143.7   male      Salzburg        9.156964       1.9028612
##  6: 2010   484  282307.3 female      Salzburg       17.939382       2.5374889
##  7: 2010   517  268581.4   male     Carinthia       10.552149       2.0614209
##  8: 2010   561  295066.6 female     Carinthia       15.392924       1.9756282
##  9: 2010   650  339566.5   male         Tyrol       12.857542       1.0371712
## 10: 2010   667  362332.5 female         Tyrol       17.604861       2.2145631
## 11: 2010  1128  571011.7   male        Styria       11.671247       1.5111686
## 12: 2010  1132  774405.4   male        Vienna       15.590616       1.3348673
## 13: 2010  1167  596033.3 female        Styria       16.964539       1.3758548
## 14: 2010  1190  824525.6 female        Vienna       18.778813       0.9304295
## 15: 2010  1363  684272.5   male Upper Austria        9.074690       1.1711956
## 16: 2010  1387  772593.2 female Lower Austria       16.372949       1.8366058
## 17: 2010  1417  783115.8   male Lower Austria       11.348283       1.6437512
## 18: 2010  1442  737347.5 female Upper Austria       12.574206       0.7710579
## 19: 2010 14827 8182222.0   <NA>          <NA>       14.444218       0.5838199

Option 3: Cobination of Option 1 and Option 2

In this case, the estimates and standard errors are calculated for

  • every gender,
  • every state and
  • every combination of state and gender.

The number of rows in the output is therefore

\[n_\text{periods}\cdot(n_\text{regions} \cdot n_\text{genders} + n_\text{regions} + n_\text{genders} + 1) = 1\cdot(9\cdot2 + 9 + 2 + 1) = 30.\]

povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, 
                             group = list("gender", "region", c("gender", "region")))
povertyRates$Estimates
##     year     n         N gender        region val_povertyRisk stE_povertyRisk
##  1: 2010   261  122741.8   male    Burgenland       17.414524       2.2332286
##  2: 2010   288  137822.2 female    Burgenland       21.432598       2.0899871
##  3: 2010   359  182732.9   male    Vorarlberg       12.973259       3.1093204
##  4: 2010   374  194622.1 female    Vorarlberg       19.883637       3.7576712
##  5: 2010   440  253143.7   male      Salzburg        9.156964       1.9028612
##  6: 2010   484  282307.3 female      Salzburg       17.939382       2.5374889
##  7: 2010   517  268581.4   male     Carinthia       10.552149       2.0614209
##  8: 2010   549  260564.0   <NA>    Burgenland       19.539837       1.8042764
##  9: 2010   561  295066.6 female     Carinthia       15.392924       1.9756282
## 10: 2010   650  339566.5   male         Tyrol       12.857542       1.0371712
## 11: 2010   667  362332.5 female         Tyrol       17.604861       2.2145631
## 12: 2010   733  377355.0   <NA>    Vorarlberg       16.537310       3.2896891
## 13: 2010   924  535451.0   <NA>      Salzburg       13.787343       2.2585866
## 14: 2010  1078  563648.0   <NA>     Carinthia       13.086268       1.7469304
## 15: 2010  1128  571011.7   male        Styria       11.671247       1.5111686
## 16: 2010  1132  774405.4   male        Vienna       15.590616       1.3348673
## 17: 2010  1167  596033.3 female        Styria       16.964539       1.3758548
## 18: 2010  1190  824525.6 female        Vienna       18.778813       0.9304295
## 19: 2010  1317  701899.0   <NA>         Tyrol       15.308190       1.3943715
## 20: 2010  1363  684272.5   male Upper Austria        9.074690       1.1711956
## 21: 2010  1387  772593.2 female Lower Austria       16.372949       1.8366058
## 22: 2010  1417  783115.8   male Lower Austria       11.348283       1.6437512
## 23: 2010  1442  737347.5 female Upper Austria       12.574206       0.7710579
## 24: 2010  2295 1167045.0   <NA>        Styria       14.374637       1.3500431
## 25: 2010  2322 1598931.0   <NA>        Vienna       17.234683       1.0428496
## 26: 2010  2804 1555709.0   <NA> Lower Austria       13.843623       1.7034375
## 27: 2010  2805 1421620.0   <NA> Upper Austria       10.889773       0.8709790
## 28: 2010  7267 3979571.7   male          <NA>       12.026600       0.6303407
## 29: 2010  7560 4202650.3 female          <NA>       16.733508       0.6300745
## 30: 2010 14827 8182222.0   <NA>          <NA>       14.444218       0.5838199
##     year     n         N gender        region val_povertyRisk stE_povertyRisk