For the most part, this document will present the functionalities of the function surveysd::calc.stError()
which generates point estimates and standard errors for user-supplied estimation functions.
In order to use a dataset with calc.stError()
, several weight columns have to be present. Each weight column corresponds to a bootstrap sample. In the following examples, we will use the data from demo.eusilc()
and attach the bootstrap weights using draw.bootstrap()
and recalib()
. Please refer to the documentation of those functions for more detail.
library(surveysd)
set.seed(1234)
eusilc <- demo.eusilc(prettyNames = TRUE)
dat_boot <- draw.bootstrap(eusilc, REP = 10, hid = "hid", weights = "pWeight",
strata = "region", period = "year")
dat_boot_calib <- recalib(dat_boot, conP.var = "gender", conH.var = "region",
epsP = 1e-2, epsH = 2.5e-2, verbose = FALSE)
dat_boot_calib[, onePerson := nrow(.SD) == 1, by = .(year, hid)]
## print part of the dataset
dat_boot_calib[1:5, .(year, povertyRisk, eqIncome, onePerson, pWeight, w1, w2, w3, w4, w5)]
## year povertyRisk eqIncome onePerson pWeight w1 w2 w3
## 1: 2010 FALSE 16090.69 FALSE 504.5696 0.4477743 1007.5651 0.4466992
## 2: 2010 FALSE 16090.69 FALSE 504.5696 0.4477743 1007.5651 0.4466992
## 3: 2010 FALSE 16090.69 FALSE 504.5696 0.4477743 1007.5651 0.4466992
## 4: 2011 FALSE 16090.69 FALSE 504.5696 0.4237016 968.2711 0.4468813
## 5: 2011 FALSE 16090.69 FALSE 504.5696 0.4237016 968.2711 0.4468813
## w4 w5
## 1: 0.4489062 1019.721
## 2: 0.4489062 1019.721
## 3: 0.4489062 1019.721
## 4: 0.4654705 1005.100
## 5: 0.4654705 1005.100
The parameters fun
and var
in calc.stError()
define the estimator to be used in the error analysis. There are two built-in estimator functions weightedSum()
and weightedRatio()
which can be used as follows.
povertyRate <- calc.stError(dat_boot_calib, var = "povertyRisk", fun = weightedRatio)
totalIncome <- calc.stError(dat_boot_calib, var = "eqIncome", fun = weightedSum)
Those functions calculate the ratio of persons at risk of poverty (in percent) and the total income. By default, the results are calculated separately for each reference period.
povertyRate$Estimates
## year n N val_povertyRisk stE_povertyRisk
## 1: 2010 14827 8182222 14.44422 0.5838199
## 2: 2011 14827 8182222 14.77393 0.6771848
## 3: 2012 14827 8182222 15.04515 0.4907275
## 4: 2013 14827 8182222 14.89013 0.4756386
## 5: 2014 14827 8182222 15.14556 0.5195149
## 6: 2015 14827 8182222 15.53640 0.4808834
## 7: 2016 14827 8182222 15.08315 0.3566648
## 8: 2017 14827 8182222 15.42019 0.5592589
totalIncome$Estimates
## year n N val_eqIncome stE_eqIncome
## 1: 2010 14827 8182222 162750998071 960851034
## 2: 2011 14827 8182222 161926931417 901915958
## 3: 2012 14827 8182222 162576509628 1198230549
## 4: 2013 14827 8182222 163199507862 1474893389
## 5: 2014 14827 8182222 163986275009 1528829971
## 6: 2015 14827 8182222 163416275447 1338217239
## 7: 2016 14827 8182222 162706205137 1207314432
## 8: 2017 14827 8182222 164314959107 1322744977
Columns that use the val_
prefix denote the point estimate belonging to the “main weight” of the dataset, which is pWeight
in case of the dataset used here.
Columns with the stE_
prefix denote standard errors calculated with bootstrap replicates. The replicates result in using w1
, w2
, …, w10
instead of pWeight
when applying the estimator.
n
denotes the number of observations for the year and N
denotes the total weight of those persons.
In order to define a custom estimator function to be used in fun
, the function needs to have two arguments like the example below.
## define custom estimator
myWeightedSum <- function(x, w) {
sum(x*w)
}
## check if results are equal to the one using `suveysd::weightedSum()`
totalIncome2 <- calc.stError(dat_boot_calib, var = "eqIncome", fun = myWeightedSum)
all.equal(totalIncome$Estimates, totalIncome2$Estimates)
## [1] TRUE
The parameters x
and w
can be assumed to be vectors with equal length with w
being numeric and x
being the column defined in the var
argument. It will be called once for each period (in this case year
) and for each weight column (in this case pWeight
, w1
, w2
, …, w10
).
In case an estimator should be applied to several columns of the dataset, var
can be set to a vector containing all necessary columns.
multipleRates <- calc.stError(dat_boot_calib, var = c("povertyRisk", "onePerson"), fun = weightedRatio)
multipleRates$Estimates
## year n N val_povertyRisk stE_povertyRisk val_onePerson
## 1: 2010 14827 8182222 14.44422 0.5838199 14.85737
## 2: 2011 14827 8182222 14.77393 0.6771848 14.85737
## 3: 2012 14827 8182222 15.04515 0.4907275 14.85737
## 4: 2013 14827 8182222 14.89013 0.4756386 14.85737
## 5: 2014 14827 8182222 15.14556 0.5195149 14.85737
## 6: 2015 14827 8182222 15.53640 0.4808834 14.85737
## 7: 2016 14827 8182222 15.08315 0.3566648 14.85737
## 8: 2017 14827 8182222 15.42019 0.5592589 14.85737
## stE_onePerson
## 1: 0.2572088
## 2: 0.2778282
## 3: 0.2909612
## 4: 0.3458258
## 5: 0.4394058
## 6: 0.3810024
## 7: 0.3105411
## 8: 0.3030293
Here we see the relative number of persons at risk of poverty and the relative number of one-person households.
The groups
argument can be used to calculate estimators for different subsets of the data. This argument can take the grouping variable as a string that refers to a column name (usually a factor) in dat
. If set, all estimators are not only split by the reference period but also by the grouping variable. For simplicity, only one reference period of the above data is used.
dat2 <- subset(dat_boot_calib, year == 2010)
for (att in c("period", "weights", "b.rep"))
attr(dat2, att) <- attr(dat_boot_calib, att)
To calculate the ratio of persons at risk of poverty for each federal state of Austria, group = "region"
can be used.
povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, group = "region")
povertyRates$Estimates
## year n N region val_povertyRisk stE_povertyRisk
## 1: 2010 549 260564 Burgenland 19.53984 1.8042764
## 2: 2010 733 377355 Vorarlberg 16.53731 3.2896891
## 3: 2010 924 535451 Salzburg 13.78734 2.2585866
## 4: 2010 1078 563648 Carinthia 13.08627 1.7469304
## 5: 2010 1317 701899 Tyrol 15.30819 1.3943715
## 6: 2010 2295 1167045 Styria 14.37464 1.3500431
## 7: 2010 2322 1598931 Vienna 17.23468 1.0428496
## 8: 2010 2804 1555709 Lower Austria 13.84362 1.7034375
## 9: 2010 2805 1421620 Upper Austria 10.88977 0.8709790
## 10: 2010 14827 8182222 <NA> 14.44422 0.5838199
The last row with region = NA
denotes the aggregate over all regions. Note that the columns N
and n
now show the weighted and unweighted number of persons in each region.
In case more than one grouping variable is used, there are several options of calling calc.stError()
depending on whether combinations of grouping levels should be regarded or not. We will consider the variables gender
and region
as our grouping variables and show three options on how calc.stError()
can be called.
Calculate the point estimate and standard error for each region and each gender. The number of rows in the output is therefore
\[n_\text{periods}\cdot(n_\text{regions} + n_\text{genders} + 1) = 1\cdot(9 + 2 + 1) = 12.\]
The last row is again the estimate for the whole period.
povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio,
group = c("gender", "region"))
povertyRates$Estimates
## year n N gender region val_povertyRisk stE_povertyRisk
## 1: 2010 549 260564 <NA> Burgenland 19.53984 1.8042764
## 2: 2010 733 377355 <NA> Vorarlberg 16.53731 3.2896891
## 3: 2010 924 535451 <NA> Salzburg 13.78734 2.2585866
## 4: 2010 1078 563648 <NA> Carinthia 13.08627 1.7469304
## 5: 2010 1317 701899 <NA> Tyrol 15.30819 1.3943715
## 6: 2010 2295 1167045 <NA> Styria 14.37464 1.3500431
## 7: 2010 2322 1598931 <NA> Vienna 17.23468 1.0428496
## 8: 2010 2804 1555709 <NA> Lower Austria 13.84362 1.7034375
## 9: 2010 2805 1421620 <NA> Upper Austria 10.88977 0.8709790
## 10: 2010 7267 3979572 male <NA> 12.02660 0.6303407
## 11: 2010 7560 4202650 female <NA> 16.73351 0.6300745
## 12: 2010 14827 8182222 <NA> <NA> 14.44422 0.5838199
state
and gender
Split the data by all combinations of the two grouping variables. This will result in a larger output-table of the size
\[n_\text{periods}\cdot(n_\text{regions} \cdot n_\text{genders} + 1) = 1\cdot(9\cdot2 + 1)= 19.\]
povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio,
group = list(c("gender", "region")))
povertyRates$Estimates
## year n N gender region val_povertyRisk stE_povertyRisk
## 1: 2010 261 122741.8 male Burgenland 17.414524 2.2332286
## 2: 2010 288 137822.2 female Burgenland 21.432598 2.0899871
## 3: 2010 359 182732.9 male Vorarlberg 12.973259 3.1093204
## 4: 2010 374 194622.1 female Vorarlberg 19.883637 3.7576712
## 5: 2010 440 253143.7 male Salzburg 9.156964 1.9028612
## 6: 2010 484 282307.3 female Salzburg 17.939382 2.5374889
## 7: 2010 517 268581.4 male Carinthia 10.552149 2.0614209
## 8: 2010 561 295066.6 female Carinthia 15.392924 1.9756282
## 9: 2010 650 339566.5 male Tyrol 12.857542 1.0371712
## 10: 2010 667 362332.5 female Tyrol 17.604861 2.2145631
## 11: 2010 1128 571011.7 male Styria 11.671247 1.5111686
## 12: 2010 1132 774405.4 male Vienna 15.590616 1.3348673
## 13: 2010 1167 596033.3 female Styria 16.964539 1.3758548
## 14: 2010 1190 824525.6 female Vienna 18.778813 0.9304295
## 15: 2010 1363 684272.5 male Upper Austria 9.074690 1.1711956
## 16: 2010 1387 772593.2 female Lower Austria 16.372949 1.8366058
## 17: 2010 1417 783115.8 male Lower Austria 11.348283 1.6437512
## 18: 2010 1442 737347.5 female Upper Austria 12.574206 0.7710579
## 19: 2010 14827 8182222.0 <NA> <NA> 14.444218 0.5838199
In this case, the estimates and standard errors are calculated for
The number of rows in the output is therefore
\[n_\text{periods}\cdot(n_\text{regions} \cdot n_\text{genders} + n_\text{regions} + n_\text{genders} + 1) = 1\cdot(9\cdot2 + 9 + 2 + 1) = 30.\]
povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio,
group = list("gender", "region", c("gender", "region")))
povertyRates$Estimates
## year n N gender region val_povertyRisk stE_povertyRisk
## 1: 2010 261 122741.8 male Burgenland 17.414524 2.2332286
## 2: 2010 288 137822.2 female Burgenland 21.432598 2.0899871
## 3: 2010 359 182732.9 male Vorarlberg 12.973259 3.1093204
## 4: 2010 374 194622.1 female Vorarlberg 19.883637 3.7576712
## 5: 2010 440 253143.7 male Salzburg 9.156964 1.9028612
## 6: 2010 484 282307.3 female Salzburg 17.939382 2.5374889
## 7: 2010 517 268581.4 male Carinthia 10.552149 2.0614209
## 8: 2010 549 260564.0 <NA> Burgenland 19.539837 1.8042764
## 9: 2010 561 295066.6 female Carinthia 15.392924 1.9756282
## 10: 2010 650 339566.5 male Tyrol 12.857542 1.0371712
## 11: 2010 667 362332.5 female Tyrol 17.604861 2.2145631
## 12: 2010 733 377355.0 <NA> Vorarlberg 16.537310 3.2896891
## 13: 2010 924 535451.0 <NA> Salzburg 13.787343 2.2585866
## 14: 2010 1078 563648.0 <NA> Carinthia 13.086268 1.7469304
## 15: 2010 1128 571011.7 male Styria 11.671247 1.5111686
## 16: 2010 1132 774405.4 male Vienna 15.590616 1.3348673
## 17: 2010 1167 596033.3 female Styria 16.964539 1.3758548
## 18: 2010 1190 824525.6 female Vienna 18.778813 0.9304295
## 19: 2010 1317 701899.0 <NA> Tyrol 15.308190 1.3943715
## 20: 2010 1363 684272.5 male Upper Austria 9.074690 1.1711956
## 21: 2010 1387 772593.2 female Lower Austria 16.372949 1.8366058
## 22: 2010 1417 783115.8 male Lower Austria 11.348283 1.6437512
## 23: 2010 1442 737347.5 female Upper Austria 12.574206 0.7710579
## 24: 2010 2295 1167045.0 <NA> Styria 14.374637 1.3500431
## 25: 2010 2322 1598931.0 <NA> Vienna 17.234683 1.0428496
## 26: 2010 2804 1555709.0 <NA> Lower Austria 13.843623 1.7034375
## 27: 2010 2805 1421620.0 <NA> Upper Austria 10.889773 0.8709790
## 28: 2010 7267 3979571.7 male <NA> 12.026600 0.6303407
## 29: 2010 7560 4202650.3 female <NA> 16.733508 0.6300745
## 30: 2010 14827 8182222.0 <NA> <NA> 14.444218 0.5838199
## year n N gender region val_povertyRisk stE_povertyRisk