Draw bootstrap replicates from survey data with rotating panel design. Survey information, like ID, sample weights, strata and population totals per strata, should be specified to ensure meaningfull survey bootstraping.
draw.bootstrap( dat, REP = 1000, hid = NULL, weights, period = NULL, strata = NULL, cluster = NULL, totals = NULL, single.PSU = c("merge", "mean"), boot.names = NULL, split = FALSE, pid = NULL, new.method = FALSE )
dat | either data.frame or data.table containing the survey data with rotating panel design. |
---|---|
REP | integer indicating the number of bootstrap replicates. |
hid | character specifying the name of the column in |
weights | character specifying the name of the column in |
period | character specifying the name of the column in |
strata | character vector specifying the name(s) of the column in |
cluster | character vector specifying cluster in the data. If not
already specified in |
totals | character specifying the name of the column in |
single.PSU | either "merge" or "mean" defining how single PSUs need to
be dealt with. For |
boot.names | character indicating the leading string of the column names for each bootstrap replica. If NULL defaults to "w". |
split | logical, if TRUE split households are considered using |
pid | column in |
new.method | logical, if TRUE bootstrap replicates will never be negative even if in some strata the whole population is in the sample. WARNING: This is still experimental and resulting standard errors might be underestimated! Use this if for some strata the whole population is in the sample! |
the survey data with the number of REP bootstrap replicates added as columns.
Returns a data.table containing the original data as well as the
number of REP
columns containing the bootstrap replicates for each
repetition.
The columns of the bootstrap replicates are by default labeled "wNumber"
where Number goes from 1 to REP
. If the column names of the bootstrap
replicates should start with a different character or string the parameter
boot.names
can be used.
draw.bootstrap
takes dat
and draws REP
bootstrap replicates
from it.
dat
must be household data where household members correspond to multiple
rows with the same household
identifier. For most practical applications, the following columns should be
available in the dataset
and passed via the corresponding parameters:
Column indicating the sample period (parameter period
).
Column indicating the household ID (parameter hid
)
Column containing the household sample weights (parameter weights
);
Columns by which population was stratified during the sampling process
(parameter: strata
).
For single stage sampling design a column the argument totals
is optional,
meaning that a column of the number of PSUs at the first stage does not need
to be supplied.
For this case the number of PSUs is calculated and added to dat
using
strata
and weights
. By setting cluster
to NULL single stage sampling
design is always assumed and
if strata
contains of multiple column names the combination of all those
column names will be used for stratification.
In the case of multi stage sampling design the argument totals
needs to be
specified and needs to have the same number of arguments as strata
.
If cluster
is NULL
or does not contain hid
at the last stage, hid
will automatically be used as the final cluster. If, besides hid
,
clustering in additional stages is specified the number of column names in
strata
and cluster
(including hid
) must be the same. If for any stage
there was no clustering or stratification one can set "1" or "I" for this
stage.
For example strata=c("REGION","I"),cluster=c("MUNICIPALITY","HID")
would
speficy a 2 stage sampling design where at the first stage the municipalities
where drawn stratified by regions
and at the 2nd stage housholds are drawn in each municipality without
stratification.
Bootstrap replicates are drawn for each survey period (period
) using the
function rescaled.bootstrap.
Afterwards the bootstrap replicates for each household are carried forward
from the first period the household enters the survey to all the censecutive
periods it stays in the survey.
This ensures that the bootstrap replicates follow the same logic as the sampled households, making the bootstrap replicates more comparable to the actual sample units.
If split
ist set to TRUE
and pid
is specified, the bootstrap replicates
are carried forward using the personal identifiers instead of the houshold
identifier.
This takes into account the issue of a houshold splitting up.
Any person in this new split household will get the same bootstrap replicate
as the person that has come from an other household in the survey.
People who enter already existing households will also get the same bootstrap
replicate as the other households members had in the previous periods.
data.table
for more information on
data.table objects.
Johannes Gussenbauer, Alexander Kowarik, Statistics Austria
if (FALSE) { eusilc <- demo.eusilc(prettyNames = TRUE) ## draw sample without stratification or clustering dat_boot <- draw.bootstrap(eusilc, REP = 10, weights = "pWeight", period = "year") ## use stratification w.r.t. region and clustering w.r.t. households dat_boot <- draw.bootstrap( eusilc, REP = 10, hid = "hid", weights = "pWeight", strata = "region", period = "year") ## use multi-level clustering dat_boot <- draw.bootstrap( eusilc, REP = 10, hid = "hid", weights = "pWeight", strata = c("region", "age"), period = "year") # create spit households eusilc[, pidsplit := pid] year <- eusilc[, unique(year)] year <- year[-1] leaf_out <- c() for(y in year) { split.person <- eusilc[ year == (y-1) & !duplicated(hid) & !(hid %in% leaf_out), sample(pid, 20) ] overwrite.person <- eusilc[ (year == (y)) & !duplicated(hid) & !(hid %in% leaf_out), .(pid = sample(pid, 20)) ] overwrite.person[, c("pidsplit", "year_curr") := .(split.person, y)] eusilc[overwrite.person, pidsplit := i.pidsplit, on = .(pid, year >= year_curr)] leaf_out <- c(leaf_out, eusilc[pid %in% c(overwrite.person$pid, overwrite.person$pidsplit), unique(hid)]) } dat_boot <- draw.bootstrap( eusilc, REP = 10, hid = "hid", weights = "pWeight", strata = c("region", "age"), period = "year", split = TRUE, pid = "pidsplit") # split households were considered e.g. household and # split household were both selected or not selected dat_boot[, data.table::uniqueN(w1), by = pidsplit][V1 > 1] }