Draw bootstrap replicates from survey data with rotating panel design. Survey information, like ID, sample weights, strata and population totals per strata, should be specified to ensure meaningfull survey bootstraping.
Usage
draw.bootstrap(
dat,
method = "Preston",
REP = 1000,
hid = NULL,
weights,
period = NULL,
strata = NULL,
cluster = NULL,
totals = NULL,
single.PSU = c("merge", "mean"),
boot.names = NULL,
split = FALSE,
pid = NULL,
seed = NULL,
already.selected = NULL
)
Arguments
- dat
either data.frame or data.table containing the survey data with rotating panel design.
- method
for bootstrap replicates, either
"Preston"
or"Rao-Wu"
- REP
integer indicating the number of bootstrap replicates.
- hid
character specifying the name of the column in
dat
containing the household id. IfNULL
(the default), the household structure is not regarded.- weights
character specifying the name of the column in
dat
containing the sample weights.- period
character specifying the name of the column in
dat
containing the sample periods. IfNULL
(the default), it is assumed that all observations belong to the same period.- strata
character vector specifying the name(s) of the column in
dat
by which the population was stratified. Ifstrata
is a vector stratification will be assumed as the combination of column names contained instrata
. Setting in additioncluster
not NULL stratification will be assumed on multiple stages, where each additional entry instrata
specifies the stratification variable for the next lower stage. see Details for more information.- cluster
character vector specifying cluster in the data. If not already specified in
cluster
household ID is taken es the lowest level cluster.- totals
character specifying the name of the column in
dat
containing the the totals per strata and/or cluster. Is ONLY optional ifcluster
isNULL
or equalhid
andstrata
contains one columnname! Then the households per strata will be calcualted using theweights
argument. If clusters and strata for multiple stages are specifiedtotals
needs to be a vector oflength(strata)
specifying the column ondat
that contain the total number of PSUs at each stage.totals
is interpreted from left the right, meaning that the first argument corresponds to the number of PSUs at the first and the last argument to the number of PSUs at the last stage.- single.PSU
either "merge" or "mean" defining how single PSUs need to be dealt with. For
single.PSU="merge"
single PSUs at each stage are merged with the strata or cluster with the next least number of PSUs. If multiple of those exist one will be select via random draw. Forsingle.PSU="mean"
single PSUs will get the mean over all bootstrap replicates at the stage which did not contain single PSUs.- boot.names
character indicating the leading string of the column names for each bootstrap replica. If NULL defaults to "w".
- split
logical, if TRUE split households are considered using
pid
, for more information see Details.- pid
column in
dat
specifying the personal identifier. This identifier needs to be unique for each person throught the whole data set. Can beNULL
.- seed
integer specifying the seed for the random number generator.
- already.selected
list of data.tables indicating if record was drawn in previous
period
.length(already.selected)
must be equal to the number of sampling stages specified. Seeget.selection()
for an example.
Value
the survey data with the number of REP bootstrap replicates added as columns.
Returns a data.table containing the original data as well as the
number of REP
columns containing the bootstrap replicates for each
repetition.
The columns of the bootstrap replicates are by default labeled "wNumber"
where Number goes from 1 to REP
. If the column names of the bootstrap
replicates should start with a different character or string the parameter
boot.names
can be used.
Details
draw.bootstrap
takes dat
and draws REP
bootstrap replicates
from it.
dat
must be household data where household members correspond to multiple
rows with the same household
identifier. For most practical applications, the following columns should be
available in the dataset
and passed via the corresponding parameters:
Column indicating the sample period (parameter
period
).Column indicating the household ID (parameter
hid
)Column containing the household sample weights (parameter
weights
);Columns by which population was stratified during the sampling process (parameter:
strata
).
As methods either the either the rescaled bootstrap for stratified
multistage sampling, presented by J. Preston (2009) (method = "Preston"
)
or the Rao-Wu boostrap by J. N. K. Rao and C. F. J. Wu (1988) (method = "Rao-Wu"
)
are supported.
For single stage sampling design a column the argument totals
is optional,
meaning that a column of the number of PSUs at the first stage does not need
to be supplied. The number of PSUs is calculated and added to dat
using
strata
and weights
. By setting cluster
to NULL single stage sampling
design is always assumed and
if strata
contains of multiple column names the combination of all those
column names will be used for stratification.
In the case of multi stage sampling design the argument totals
needs to be
specified and needs to have the same number of arguments as strata
.
If cluster
is NULL
or does not contain hid
at the last stage, hid
will automatically be used as the final cluster. If, besides hid
,
clustering in additional stages is specified the number of column names in
strata
and cluster
(including hid
) must be the same. If for any stage
there was no clustering or stratification one can set "1" or "I" for this
stage.
For example strata=c("REGION","I"),cluster=c("MUNICIPALITY","HID")
would
speficy a 2 stage sampling design where at the first stage the municipalities
where drawn stratified by regions
and at the 2nd stage housholds are drawn in each municipality without
stratification.
Bootstrap replicates are drawn for each survey period consecutively (period
) using the
function rescaled.bootstrap.
Bootstrap replicates are drawn consistently in the way that in each period
and
sampling stage always \(\floor{n/2}\) clusters are selected in each strata.
This ensures that the bootstrap replicates follow the same logic as the
sampled households, making the bootstrap replicates more comparable to the
actual sample units.
If split
ist set to TRUE
and pid
is specified, the bootstrap replicates
are carried forward using the personal identifiers instead of the household
identifier.
This takes into account the issue of a houshold splitting up.
Any person in this new split household will get the same bootstrap replicate
as the person that has come from an other household in the survey.
People who enter already existing households will also get the same bootstrap
replicate as the other households members had in the previous periods.
already.selected
can be specified in order to construct bootstrap replicates
considering that already drawn bootstrap replicates from a previous period
exist for the same survey. See get.selection()
for more explanations and examples.
References
Preston, J. (2009). Rescaled bootstrap for stratified multistage sampling. Survey Methodology. 35. 227-234. Rao, J. N. K., and C. F. J. Wu. (1988). Resampling Inference with Complex Survey Data. Journal of the American Statistical Association 83 (401): 231–41.
See also
data.table
for more information on
data.table objects.
Examples
library(surveysd)
library(data.table)
setDTthreads(1)
set.seed(1234)
eusilc <- demo.eusilc(n = 3, prettyNames = TRUE)
## draw replicates without stratification or clustering
dat_boot <- draw.bootstrap(eusilc, REP = 1, weights = "pWeight",
period = "year")
## use stratification w.r.t. region and clustering w.r.t. households
dat_boot <- draw.bootstrap(
eusilc, REP = 1, hid = "hid", weights = "pWeight",
strata = "region", period = "year")
## use multi-level clustering
dat_boot <- draw.bootstrap(
eusilc, REP = 1, hid = "hid", weights = "pWeight",
strata = c("region", "hsize"), period = "year")
# create spit households
eusilc[, pidsplit := pid]
#> hid hsize region pid age gender ecoStat citizenship
#> <int> <fctr> <fctr> <int> <fctr> <fctr> <fctr> <fctr>
#> 1: 1 3 Tyrol 101 (25,45] female part time AT
#> 2: 1 3 Tyrol 102 (25,45] male full time Other
#> 3: 1 3 Tyrol 103 (-Inf,16] male <NA> <NA>
#> 4: 2 4 Tyrol 201 (25,45] female domestic AT
#> 5: 2 4 Tyrol 202 (25,45] male full time AT
#> ---
#> 44477: 8999 4 Lower Austria 899904 (-Inf,16] female education AT
#> 44478: 9000 1 Upper Austria 900001 (25,45] female full time AT
#> 44479: 5999 1 Tyrol 599901 (25,45] male full time AT
#> 44480: 7500 2 Tyrol 750001 (45,65] male full time AT
#> 44481: 7500 2 Tyrol 750002 (45,65] female disabled AT
#> py010n py050n py090n py100n py110n py120n py130n py140n hy040n
#> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 9756.25 0 0.00 0 0 0 0 0 4273.9
#> 2: 12471.60 0 0.00 0 0 0 0 0 4273.9
#> 3: NA NA NA NA NA NA NA NA 4273.9
#> 4: 12487.03 0 0.00 0 0 0 0 0 0.0
#> 5: 42821.23 0 0.00 0 0 0 0 0 0.0
#> ---
#> 44477: 0.00 0 0.00 0 0 0 0 0 0.0
#> 44478: 13962.56 0 0.00 0 0 0 0 0 0.0
#> 44479: 14685.18 0 0.00 0 0 0 0 0 0.0
#> 44480: 20606.82 0 0.00 0 0 0 0 0 0.0
#> 44481: 0.00 0 3825.63 0 0 0 0 0 0.0
#> hy050n hy070n hy080n hy090n hy110n hy130n hy145n eqSS eqIncome
#> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 2428.11 0 0 33.39 0 0 0 1.8 16090.69
#> 2: 2428.11 0 0 33.39 0 0 0 1.8 16090.69
#> 3: 2428.11 0 0 33.39 0 0 0 1.8 16090.69
#> 4: 1549.72 0 0 2.13 0 0 0 2.1 27076.24
#> 5: 1549.72 0 0 2.13 0 0 0 2.1 27076.24
#> ---
#> 44477: 1955.19 0 0 0.00 0 0 0 2.5 26970.02
#> 44478: 0.00 0 0 424.85 0 0 0 1.0 20242.09
#> 44479: 0.00 0 0 120.65 0 0 0 1.0 14805.83
#> 44480: 0.00 0 0 0.00 0 0 0 1.5 24680.88
#> 44481: 0.00 0 0 0.00 0 0 0 1.5 24680.88
#> db090 pWeight year povertyRisk pidsplit
#> <num> <num> <num> <lgcl> <int>
#> 1: 504.5696 504.5696 2010 FALSE 101
#> 2: 504.5696 504.5696 2010 FALSE 102
#> 3: 504.5696 504.5696 2010 FALSE 103
#> 4: 493.3824 493.3824 2010 FALSE 201
#> 5: 493.3824 493.3824 2010 FALSE 202
#> ---
#> 44477: 556.4260 556.4260 2012 FALSE 899904
#> 44478: 643.2557 643.2557 2012 FALSE 900001
#> 44479: 679.7288 679.7288 2012 FALSE 599901
#> 44480: 567.1544 567.1544 2012 FALSE 750001
#> 44481: 567.1544 567.1544 2012 FALSE 750002
year <- eusilc[, unique(year)]
year <- year[-1]
leaf_out <- c()
for(y in year) {
split.person <- eusilc[
year == (y-1) & !duplicated(hid) & !(hid %in% leaf_out),
sample(pid, 20)
]
overwrite.person <- eusilc[
(year == (y)) & !duplicated(hid) & !(hid %in% leaf_out),
.(pid = sample(pid, 20))
]
overwrite.person[, c("pidsplit", "year_curr") := .(split.person, y)]
eusilc[overwrite.person, pidsplit := i.pidsplit,
on = .(pid, year >= year_curr)]
leaf_out <- c(leaf_out,
eusilc[pid %in% c(overwrite.person$pid,
overwrite.person$pidsplit),
unique(hid)])
}
dat_boot <- draw.bootstrap(
eusilc, REP = 1, hid = "hid", weights = "pWeight",
strata = c("region", "hsize"), period = "year", split = TRUE,
pid = "pidsplit")
# split households were considered e.g. household and
# split household were both selected or not selected
dat_boot[, data.table::uniqueN(w1), by = pidsplit][V1 > 1]
#> Empty data.table (0 rows and 2 cols): pidsplit,V1