R/vimpute.R
vimpute.RdImpute missing values with prefered model, sequentially, with hyperparametertuning and with PMM (if wanted) Need of 'helper_vimpute' script
vimpute(
data,
considered_variables = names(data),
method = setNames(as.list(rep("ranger", length(considered_variables))),
considered_variables),
pmm = FALSE,
pmm_k = NULL,
pmm_k_method = "mean",
learner_params = NULL,
formula = FALSE,
sequential = TRUE,
nseq = 10,
eps = 0.005,
imp_var = TRUE,
pred_history = FALSE,
tune = FALSE,
verbose = FALSE
)Dataset with missing values. Provide as a data.table.
A character vector of variable names to be either imputed or used as predictors, excluding irrelevant columns from the imputation process.
Specifies the imputation method for each variable. Can be provided either:
as a single global method (e.g. "ranger"), applied to all variables, or
as a named list (e.g. as.list(var1 = "xgboost", var2="robust")), assigning a method to each variable individually. Supported methods:
Predictive Mean Matching (PMM) settings. Can be provided:
as a single TRUE/FALSE (global), or
as a named list, assigning PMM per (numeric) variable.
Number of nearest neighbors used in PMM. Accepted forms:
single global integer (applies to all variables), or
named list assigning values per variable, or
NULL (default), meaning:
k = 1 automatically for variables using PMM,
k = NULL for variables without PMM
Aggregation method used when pmm_k > 1 in PMM.
Default is "mean".
Accepted forms:
single global string ("mean", "median", "random"), or
single global function (called with the k nearest observed values), or
named list assigning methods per variable, or
NULL values inside such lists, which fall back to "mean"
Semantics:
"mean": mean of the k nearest neighbors
"median": median of the k nearest neighbors
"random": random draw of one among the k nearest neighbors
function: custom aggregator returning one numeric value
Hyperparameters for the chosen methods. Can be provided in three ways:
Per variable (e.g. list(mpg = list(num.trees = 500)))
Per method (e.g. list(ranger = list(num.trees = 600)))
Global, applied to all variables using the same method
Optional modeling formula to restrict or transform predictor variables. Only supported for regularized (glmnet) and robust (lmrob/glmrob) methods Provide as a named list, e.g.:
list(mpg = mpg ~ hp + drat)
list(hp = log(hp) ~ wt + cyl) For X: follows the rules of model.matrix For Y: transformations supported are log(), exp(), sqrt(), I(1/..). Only applicable for numeric variables.
If TRUE, all variables with missing data are imputed sequentially across iterations.
Maximum number of iterations (if sequential is TRUE).
Convergence threshold: the imputation process stops early if predictions change less than this amount across iterations.
If TRUE, additional columns indicating imputed values (VAR_imp) are added.
If TRUE, all predicted values across all iterations are stored.
Hyperparameter tuning flag. Can be:
TRUE/FALSE globally
or a list specifying tuning per variable, e.g. list(var1 = TRUE) Tuning is performed halfway through nseq iterations.
If TRUE additional debugging output is provided
Either:
the imputed dataset (default), or
a list containing the imputed dataset and prediction history, depending on the pred_history and tune settings.
Other imputation methods:
hotdeck(),
impPCA(),
imputeRobust(),
imputeRobustChain(),
irmi(),
kNN(),
matchImpute(),
medianSamp(),
rangerImpute(),
regressionImp(),
sampleCat(),
xgboostImpute()
if (FALSE) { # \dontrun{
x <- vimpute(data = sleep, sequential = FALSE)
y <- vimpute(data = sleep, sequential = TRUE, nseq = 3)
z <- vimpute(data = sleep, considered_variables =
c("Sleep", "Dream", "Span", "BodyWgt"), sequential = FALSE)
} # }