Calculate point estimates as well as standard errors of variables in surveys. Standard errors are estimated using bootstrap weights (see draw.bootstrap and recalib). In addition the standard error of an estimate can be calcualted using the survey data for 3 or more consecutive periods, which results in a reduction of the standard error.

calc.stError(
  dat,
  weights = attr(dat, "weights"),
  b.weights = attr(dat, "b.rep"),
  period = attr(dat, "period"),
  var,
  fun = weightedRatio,
  national = FALSE,
  group = NULL,
  fun.adjust.var = NULL,
  adjust.var = NULL,
  period.diff = NULL,
  period.mean = NULL,
  bias = FALSE,
  size.limit = 20,
  cv.limit = 10,
  p = NULL,
  add.arg = NULL
)

Arguments

dat

either data.frame or data.table containing the survey data. Surveys can be a panel survey or rotating panel survey, but does not need to be. For rotating panel survey bootstrap weights can be created using draw.bootstrap and recalib.

weights

character specifying the name of the column in dat containing the original sample weights. Used to calculate point estimates.

b.weights

character vector specifying the names of the columns in dat containing bootstrap weights. Used to calculate standard errors.

period

character specifying the name of the column in dat containing the sample periods.

var

character vector containing variable names in dat on which fun shall be applied for each sample period.

fun

function which will be applied on var for each sample period. Predefined functions are weightedRatio, weightedSum, but can also take any other function which returns a double or integer and uses weights as its second argument.

national

boolean, if TRUE point estimates resulting from fun will be divided by the point estimate at the national level.

group

character vectors or list of character vectors containig variables in dat. For each list entry dat will be split in subgroups according to the containing variables as well as period. The pointestimates are then estimated for each subgroup seperately. If group=NULL the data will split into sample periods by default.

fun.adjust.var

can be either NULL or a function. This argument can be used to apply a function for each period and bootstrap weight to the data. The resulting estimates will be passed down to fun. See details for more explanations.

adjust.var

can be either NULL or a character specifying the first argument in fun.adjust.var.

period.diff

character vectors, defining periods for which the differences in the point estimate as well it's standard error is calculated. Each entry must have the form of "period1 - period2". Can be NULL

period.mean

odd integer, defining the range of periods over which the sample mean of point estimates is additionally calcualted.

bias

boolean, if TRUE the sample mean over the point estimates of the bootstrap weights is returned.

size.limit

integer defining a lower bound on the number of observations on dat in each group defined by period and the entries in group. Warnings are returned if the number of observations in a subgroup falls below size.limit. In addition the concerned groups are available in the function output.

cv.limit

non-negativ value defining a upper bound for the standard error in relation to the point estimate. If this relation exceed cv.limit, for a point estimate, they are flagged and available in the function output.

p

numeric vector containing values between 0 and 1. Defines which quantiles for the distribution of var are additionally estimated.

add.arg

additional arguments which will be passed to fun. Can be either a named list or vector. The names of the object correspond to the function arguments and the values to column names in dat, see also examples.

Value

Returns a list containing:

  • Estimates: data.table containing period differences and/or k period averages for estimates of fun applied to var as well as the corresponding standard errors, which are calculated using the bootstrap weights. In addition the sample size, n, and poplutaion size for each group is added to the output.

  • smallGroups: data.table containing groups for which the number of observation falls below size.limit.

  • cvHigh: data.table containing a boolean variable which indicates for each estimate if the estimated standard error exceeds cv.limit.

  • stEDecrease: data.table indicating for each estimate the theoretical increase in sample size which is gained when averaging over k periods. Only returned if period.mean is not NULL.

Details

calc.stError takes survey data (dat) and returns point estimates as well as their standard Errors defined by fun and var for each sample period in dat. dat must be household data where household members correspond to multiple rows with the same household identifier. The data should at least contain the following columns:

  • Column indicating the sample period;

  • Column indicating the household ID;

  • Column containing the household sample weights;

  • Columns which contain the bootstrap weights (see output of recalib);

  • Columns listed in var as well as in group

For each variable in var as well as sample period the function fun is applied using the original as well as the bootstrap sample weights.
The point estimate is then selected as the result of fun when using the original sample weights and it's standard error is estimated with the result of fun using the bootstrap sample weights.

fun can be any function which returns a double or integer and uses sample weights as it's second argument. The predifined options are weightedRatio and weightedSum.

For the option weightedRatio a weighted ratio (in \ calculated for var equal to 1, e.g sum(weight[var==1])/sum(weight[!is.na(var)])*100.
Additionally using the option national=TRUE the weighted ratio (in \ divided by the weighted ratio at the national level for each period.
If group is not NULL but a vector of variables from dat then fun is applied on each subset of dat defined by all combinations of values in group.
For instance if group = "sex" with "sex" having the values "Male" and "Female" in dat the point estimate and standard error is calculated on the subsets of dat with only "Male" or "Female" value for "sex". This is done for each value of period. For variables in group which have NAs in dat the rows containing the missings will be discarded.
When group is a list of character vectors, subsets of dat and the following estimation of the point estimate, including the estimate for the standard error, are calculated for each list entry.

The optional parameters fun.adjust.var and adjust.var can be used if the values in var are dependent on the weights. As is for instance the case for the poverty thershhold calculated from EU-SILC. In such a case an additional function can be supplied using fun.adjust.var as well as its first argument adjust.var, which needs to be part of the data set dat. Then, before applying fun on variable var for all period and groups, the function fun.adjust.var is applied to adjust.var using each of the bootstrap weights seperately (NOTE: weight is used as the second argument of fun.adjust.var). Thus creating i=1,...,length(b.weights) additional variables. For applying fun on var the estimates for the bootstrap replicate will now use each of the corresponding new additional variables. So instead of $$fun(var,weights,...),fun(var,b.weights[1],...), fun(var,b.weights[2],...),...$$ the function fun will be applied in the way $$fun(var,weights,...),fun(var.1,b.weights[1],...),fun(var.2, b.weights[2],...),...$$

where var.1, var.2, ... correspond to the estimates resulting from fun.adjust.var and adjust.var. NOTE: This procedure is especially usefull if the var is dependent on weights and fun is applied on subgroups of the data set. Then it is not possible to capture this procedure with fun and var, see examples for a more hands on explanation.
When defining period.diff the difference of point estimates between periods as well their standard errors are calculated.
The entries in period.diff must have the form of "period1 - period2" which means that the results of the point estimates for period2 will be substracted from the results of the point estimates for period1.

Specifying period.mean leads to an improvement in standard error by averaging the results for the point estimates, using the bootstrap weights, over period.mean periods. Setting, for instance, period.mean = 3 the results in averaging these results over each consecutive set of 3 periods.
Estimating the standard error over these averages gives an improved estimate of the standard error for the central period, which was used for averaging.
The averaging of the results is also applied in differences of point estimates. For instance defining period.diff = "2015-2009" and period.mean = 3 the differences in point estimates of 2015 and 2009, 2016 and 2010 as well as 2014 and 2008 are calcualated and finally the average over these 3 differences is calculated. The periods set in period.diff are always used as the middle periods around which the mean over period.mean years is build.
Setting bias to TRUE returns the calculation of a mean over the results from the bootstrap replicates. In the output the corresponding columns is labeled _mean at the end.

If fun needs more arguments they can be supplied in add.arg. This can either be a named list or vector.

The parameter size.limit indicates a lower bound of the sample size for subsets in dat created by group. If the sample size of a subset falls below size.limit a warning will be displayed.
In addition all subsets for which this is the case can be selected from the output of calc.stError with $smallGroups.
With the parameter cv.limit one can set an upper bound on the coefficient of variantion. Estimates which exceed this bound are flagged with TRUE and are available in the function output with $cvHigh. cv.limit must be a positive integer and is treated internally as \ for cv.limit=1 the estimate will be flagged if the coefficient of variantion exceeds 1\
When specifying period.mean, the decrease in standard error for choosing this method is internally calcualted and a rough estimate for an implied increase in sample size is available in the output with $stEDecrease. The rough estimate for the increase in sample size uses the fact that for a sample of size \(n\) the sample estimate for the standard error of most point estimates converges with a factor \(1/\sqrt{n}\) against the true standard error \(\sigma\).

See also

Author

Johannes Gussenbauer, Alexander Kowarik, Statistics Austria

Examples

# Import data and calibrate set.seed(1234) eusilc <- demo.eusilc(n = 4,prettyNames = TRUE) dat_boot <- draw.bootstrap(eusilc, REP = 3, hid = "hid", weights = "pWeight", strata = "region", period = "year") dat_boot_calib <- recalib(dat_boot, conP.var = "gender", conH.var = "region")
#> Iteration stopped after 2 steps
#> Convergence reached
#> Iteration stopped after 3 steps
#> Convergence reached
#> Iteration stopped after 3 steps
#> Convergence reached
# estimate weightedRatio for povertyRisk per period err.est <- calc.stError(dat_boot_calib, var = "povertyRisk", fun = weightedRatio) err.est$Estimates
#> year n N val_povertyRisk stE_povertyRisk #> 1: 2010 14827 8182222 14.44422 1.0561405 #> 2: 2011 14827 8182222 14.77393 1.0159153 #> 3: 2012 14827 8182222 15.04515 1.1458727 #> 4: 2013 14827 8182222 14.89013 0.9472434
# calculate weightedRatio for povertyRisk and fraction of one-person # households per period dat_boot_calib[, onePerson := .N == 1, by = .(year, hid)]
#> hid hsize region pid age gender ecoStat citizenship #> 1: 1 (2,3] Tyrol 101 (25,45] female part time AT #> 2: 1 (2,3] Tyrol 102 (25,45] male full time Other #> 3: 1 (2,3] Tyrol 103 (-Inf,16] male <NA> <NA> #> 4: 1 (2,3] Tyrol 101 (25,45] female part time AT #> 5: 1 (2,3] Tyrol 102 (25,45] male full time Other #> --- #> 59304: 10499 (3,4] Lower Austria 1049901 (45,65] male full time AT #> 59305: 10499 (3,4] Lower Austria 1049902 (45,65] female domestic AT #> 59306: 10499 (3,4] Lower Austria 1049903 (25,45] male full time AT #> 59307: 10499 (3,4] Lower Austria 1049904 (16,25] female domestic AT #> 59308: 10500 (0,1] Upper Austria 1050001 (25,45] female full time AT #> py010n py050n py090n py100n py110n py120n py130n py140n hy040n #> 1: 9756.25 0 0.00 0 0 0 0.00 0 4273.9 #> 2: 12471.60 0 0.00 0 0 0 0.00 0 4273.9 #> 3: NA NA NA NA NA NA NA NA 4273.9 #> 4: 9756.25 0 0.00 0 0 0 0.00 0 4273.9 #> 5: 12471.60 0 0.00 0 0 0 0.00 0 4273.9 #> --- #> 59304: 22534.03 0 0.00 0 0 0 3023.79 0 0.0 #> 59305: 0.00 0 0.00 0 0 0 0.00 0 0.0 #> 59306: 0.00 0 5848.37 0 0 0 0.00 0 0.0 #> 59307: 0.00 0 3737.27 0 0 0 0.00 0 0.0 #> 59308: 13962.56 0 0.00 0 0 0 0.00 0 0.0 #> hy050n hy070n hy080n hy090n hy110n hy130n hy145n eqSS eqIncome #> 1: 2428.11 0 0 33.39 0 0 0 1.8 16090.694 #> 2: 2428.11 0 0 33.39 0 0 0 1.8 16090.694 #> 3: 2428.11 0 0 33.39 0 0 0 1.8 16090.694 #> 4: 2428.11 0 0 33.39 0 0 0 1.8 16090.694 #> 5: 2428.11 0 0 33.39 0 0 0 1.8 16090.694 #> --- #> 59304: 0.00 0 0 361.35 0 0 0 2.5 20360.440 #> 59305: 0.00 0 0 361.35 0 0 0 2.5 20360.440 #> 59306: 0.00 0 0 361.35 0 0 0 2.5 20360.440 #> 59307: 0.00 0 0 361.35 0 0 0 2.5 20360.440 #> 59308: 0.00 0 0 424.85 0 0 0 1.0 6923.625 #> db090 pWeight year povertyRisk w1 w2 w3 #> 1: 504.5696 504.5696 2010 FALSE 0.4451532 0.4412852 998.4909445 #> 2: 504.5696 504.5696 2010 FALSE 0.4451532 0.4412852 998.4909445 #> 3: 504.5696 504.5696 2010 FALSE 0.4451532 0.4412852 998.4909445 #> 4: 504.5696 504.5696 2011 FALSE 0.4342066 0.4257681 983.7207754 #> 5: 504.5696 504.5696 2011 FALSE 0.4342066 0.4257681 983.7207754 #> --- #> 59304: 556.4260 556.4260 2013 FALSE 1.0081397 1.0156619 0.9745348 #> 59305: 556.4260 556.4260 2013 FALSE 1.0081397 1.0156619 0.9745348 #> 59306: 556.4260 556.4260 2013 FALSE 1.0081397 1.0156619 0.9745348 #> 59307: 556.4260 556.4260 2013 FALSE 1.0081397 1.0156619 0.9745348 #> 59308: 643.2557 643.2557 2013 TRUE 1252.9016300 0.6142391 0.6007900 #> onePerson #> 1: FALSE #> 2: FALSE #> 3: FALSE #> 4: FALSE #> 5: FALSE #> --- #> 59304: FALSE #> 59305: FALSE #> 59306: FALSE #> 59307: FALSE #> 59308: TRUE
err.est <- calc.stError(dat_boot_calib, var = c("povertyRisk", "onePerson"), fun = weightedRatio) err.est$Estimates
#> year n N val_povertyRisk stE_povertyRisk val_onePerson #> 1: 2010 14827 8182222 14.44422 1.0561405 14.85737 #> 2: 2011 14827 8182222 14.77393 1.0159153 14.85737 #> 3: 2012 14827 8182222 15.04515 1.1458727 14.85737 #> 4: 2013 14827 8182222 14.89013 0.9472434 14.85737 #> stE_onePerson #> 1: 0.50178225 #> 2: 0.29752306 #> 3: 0.24021487 #> 4: 0.03895766
# estimate weightedRatio for povertyRisk per period and gender group <- "gender" err.est <- calc.stError(dat_boot_calib, var = "povertyRisk", fun = weightedRatio, group = group) err.est$Estimates
#> year n N gender val_povertyRisk stE_povertyRisk #> 1: 2010 7267 3979572 male 12.02660 1.2626307 #> 2: 2010 7560 4202650 female 16.73351 0.9334663 #> 3: 2010 14827 8182222 <NA> 14.44422 1.0561405 #> 4: 2011 7267 3979572 male 12.81921 1.0552081 #> 5: 2011 7560 4202650 female 16.62488 0.9802308 #> 6: 2011 14827 8182222 <NA> 14.77393 1.0159153 #> 7: 2012 7267 3979572 male 13.76065 1.3728750 #> 8: 2012 7560 4202650 female 16.26147 0.9257788 #> 9: 2012 14827 8182222 <NA> 15.04515 1.1458727 #> 10: 2013 7267 3979572 male 13.88962 1.0523561 #> 11: 2013 7560 4202650 female 15.83754 0.8533355 #> 12: 2013 14827 8182222 <NA> 14.89013 0.9472434
# estimate weightedRatio for povertyRisk per period and gender, region and # combination of both group <- list("gender", "region", c("gender", "region")) err.est <- calc.stError(dat_boot_calib, var = "povertyRisk", fun = weightedRatio, group = group) err.est$Estimates
#> year n N gender region val_povertyRisk stE_povertyRisk #> 1: 2010 261 122741.8 male Burgenland 17.414524 1.8440173 #> 2: 2010 288 137822.2 female Burgenland 21.432598 1.5084922 #> 3: 2010 359 182732.9 male Vorarlberg 12.973259 3.2756373 #> 4: 2010 374 194622.1 female Vorarlberg 19.883637 3.2976678 #> 5: 2010 440 253143.7 male Salzburg 9.156964 2.0748992 #> --- #> 116: 2013 2804 1555709.0 <NA> Lower Austria 14.340485 0.2188551 #> 117: 2013 2805 1421620.0 <NA> Upper Austria 14.400780 1.5636545 #> 118: 2013 7267 3979571.7 male <NA> 13.889623 1.0523561 #> 119: 2013 7560 4202650.3 female <NA> 15.837536 0.8533355 #> 120: 2013 14827 8182222.0 <NA> <NA> 14.890134 0.9472434
# use average over 3 periods for standard error estimation err.est <- calc.stError(dat_boot_calib, var = "povertyRisk", fun = weightedRatio, period.mean = 3) err.est$Estimates
#> year n N val_povertyRisk stE_povertyRisk #> 1: 2010 14827 8182222 14.44422 1.0561405 #> 2: 2010_2011_2012 14827 8182222 14.75443 1.0655681 #> 3: 2011 14827 8182222 14.77393 1.0159153 #> 4: 2011_2012_2013 14827 8182222 14.90307 1.0320904 #> 5: 2012 14827 8182222 15.04515 1.1458727 #> 6: 2013 14827 8182222 14.89013 0.9472434
# get estimate for difference of period 2011 and 2012 period.diff <- c("2012-2011") err.est <- calc.stError( dat_boot_calib, var = "povertyRisk", fun = weightedRatio, period.diff = period.diff, period.mean = 3) err.est$Estimates
#> year n N val_povertyRisk stE_povertyRisk #> 1: 2010 14827 8182222 14.4442182 1.05614046 #> 2: 2010_2011_2012 14827 8182222 14.7544308 1.06556810 #> 3: 2011 14827 8182222 14.7739255 1.01591526 #> 4: 2011_2012_2013 14827 8182222 14.9030692 1.03209037 #> 5: 2012 14827 8182222 15.0451487 1.14587269 #> 6: 2012-2011 14827 8182222 0.2712233 0.24052436 #> 7: 2012-2011_mean 14827 8182222 0.1486385 0.04551717 #> 8: 2013 14827 8182222 14.8901335 0.94724335
# use add.arg-argument fun <- function(x, w, b) { sum(x*w*b) } add.arg = list(b="onePerson") err.est <- calc.stError(dat_boot_calib, var = "povertyRisk", fun = fun, period.mean = 0, add.arg=add.arg) err.est$Estimates
#> year n N val_povertyRisk stE_povertyRisk #> 1: 2010 14827 8182222 273683.9 6134.092 #> 2: 2011 14827 8182222 261883.6 6785.949 #> 3: 2012 14827 8182222 243083.9 3261.701 #> 4: 2013 14827 8182222 238004.4 16875.434
# compare with direkt computation compare.value <- dat_boot_calib[,fun(povertyRisk,pWeight,b=onePerson), by=c("year")] all((compare.value$V1-err.est$Estimates$val_povertyRisk)==0)
#> [1] TRUE
# use a function from an other package that has sampling weights as its # second argument # for example gini() from laeken library(laeken) ## set up help function that returns only the gini index help_gini <- function(x, w) { return(gini(x, w)$value) } ## make sure povertyRisk get coerced to a numeric in order to work with the ## external functions invisible(dat_boot_calib[, povertyRisk := as.numeric(povertyRisk)]) err.est <- calc.stError( dat_boot_calib, var = "povertyRisk", fun = help_gini, group = group, period.diff = period.diff, period.mean = 3) err.est$Estimates
#> year n N gender region val_povertyRisk stE_povertyRisk #> 1: 2010 261 122741.8 male Burgenland 82.58548 1.8440173 #> 2: 2010 288 137822.2 female Burgenland 78.56740 1.5084922 #> 3: 2010 359 182732.9 male Vorarlberg 87.02674 3.2756373 #> 4: 2010 374 194622.1 female Vorarlberg 80.11636 3.2976678 #> 5: 2010 440 253143.7 male Salzburg 90.84304 2.0748992 #> --- #> 236: 2013 2804 1555709.0 <NA> Lower Austria 85.65952 0.2188551 #> 237: 2013 2805 1421620.0 <NA> Upper Austria 85.59922 1.5636545 #> 238: 2013 7267 3979571.7 male <NA> 86.11038 1.0523561 #> 239: 2013 7560 4202650.3 female <NA> 84.16246 0.8533355 #> 240: 2013 14827 8182222.0 <NA> <NA> 85.10987 0.9472434
# using fun.adjust.var and adjust.var to estimate povmd60 indicator # for each period and bootstrap weight before applying the weightedRatio # point estimate # this function estimates the povmd60 indicator with x as income vector # and w as weight vector povmd <- function(x, w) { md <- laeken::weightedMedian(x, w)*0.6 pmd60 <- x < md return(as.integer(pmd60)) } # set adjust.var="eqIncome" so the income vector ist used to estimate # the povmd60 indicator for each bootstrap weight # and the resultung indicators are passed to function weightedRatio err.est <- calc.stError( dat_boot_calib, var = "povertyRisk", fun = weightedRatio, group = group, fun.adjust.var = povmd, adjust.var = "eqIncome", period.mean = 3) err.est$Estimates
#> year n N gender region val_povertyRisk stE_povertyRisk #> 1: 2010 261 122741.8 male Burgenland 17.414524 1.844017 #> 2: 2010 288 137822.2 female Burgenland 21.432598 1.252953 #> 3: 2010 359 182732.9 male Vorarlberg 12.973259 3.275637 #> 4: 2010 374 194622.1 female Vorarlberg 19.883637 3.297668 #> 5: 2010 440 253143.7 male Salzburg 9.156964 2.074899 #> --- #> 176: 2013 2804 1555709.0 <NA> Lower Austria 14.340485 0.454558 #> 177: 2013 2805 1421620.0 <NA> Upper Austria 14.400780 0.907734 #> 178: 2013 7267 3979571.7 male <NA> 13.889623 1.115469 #> 179: 2013 7560 4202650.3 female <NA> 15.837536 0.973553 #> 180: 2013 14827 8182222.0 <NA> <NA> 14.890134 1.036791
# why fun.adjust.var and adjust.var are needed (!!!): # one could also use the following function # and set fun.adjust.var=NULL,adjust.var=NULL # and set fun = povmd, var = "eqIncome" povmd2 <- function(x, w) { md <- laeken::weightedMedian(x, w)*0.6 pmd60 <- x < md # weighted ratio is directly estimated inside my function return(sum(w[pmd60])/sum(w)*100) } # but this results in different results in subgroups # compared to using fun.adjust.var and adjust.var err.est.different <- calc.stError( dat_boot_calib, var = "eqIncome", fun = povmd2, group = group, fun.adjust.var = NULL, adjust.var = NULL, period.mean = 3) err.est.different$Estimates
#> year n N gender region val_eqIncome stE_eqIncome #> 1: 2010 261 122741.8 male Burgenland 18.61871 1.9754854 #> 2: 2010 288 137822.2 female Burgenland 18.12804 3.4270516 #> 3: 2010 359 182732.9 male Vorarlberg 14.10553 3.1759133 #> 4: 2010 374 194622.1 female Vorarlberg 19.34374 4.6825885 #> 5: 2010 440 253143.7 male Salzburg 11.71768 3.2418638 #> --- #> 176: 2013 2804 1555709.0 <NA> Lower Austria 14.82065 0.5475655 #> 177: 2013 2805 1421620.0 <NA> Upper Austria 14.51550 0.5578348 #> 178: 2013 7267 3979571.7 male <NA> 15.23198 1.0568032 #> 179: 2013 7560 4202650.3 female <NA> 14.88139 1.1991456 #> 180: 2013 14827 8182222.0 <NA> <NA> 14.89013 1.0367907
## results are equal for yearly estimates all.equal(err.est.different$Estimates[is.na(gender) & is.na(region)], err.est$Estimates[is.na(gender)&is.na(region)], check.attributes = FALSE)
#> [1] TRUE
## but for subgroups (gender, region) results vary all.equal(err.est.different$Estimates[!(is.na(gender) & is.na(region))], err.est$Estimates[!(is.na(gender) & is.na(region))], check.attributes = FALSE)
#> [1] "Column 'val_eqIncome': Mean relative difference: 0.08000699"