Impute missing values with prefered model, sequentially, with hyperparametertuning and with PMM (if wanted) Need of 'helper_vimpute' script

vimpute(
  data,
  considered_variables = names(data),
  method = setNames(as.list(rep("ranger", length(considered_variables))),
    considered_variables),
  pmm = FALSE,
  pmm_k = NULL,
  pmm_k_method = "mean",
  learner_params = NULL,
  formula = FALSE,
  sequential = TRUE,
  nseq = 10,
  eps = 0.005,
  imp_var = TRUE,
  pred_history = FALSE,
  tune = FALSE,
  verbose = FALSE
)

Arguments

data

Dataset with missing values. Provide as a data.table.

considered_variables

A character vector of variable names to be either imputed or used as predictors, excluding irrelevant columns from the imputation process.

method

Specifies the imputation method for each variable. Can be provided either:

  • as a single global method (e.g. "ranger"), applied to all variables, or

  • as a named list (e.g. as.list(var1 = "xgboost", var2="robust")), assigning a method to each variable individually. Supported methods:

pmm

Predictive Mean Matching (PMM) settings. Can be provided:

  • as a single TRUE/FALSE (global), or

  • as a named list, assigning PMM per (numeric) variable.

pmm_k

Number of nearest neighbors used in PMM. Accepted forms:

  • single global integer (applies to all variables), or

  • named list assigning values per variable, or

  • NULL (default), meaning:

    • k = 1 automatically for variables using PMM,

    • k = NULL for variables without PMM

pmm_k_method

Aggregation method used when pmm_k > 1 in PMM. Default is "mean". Accepted forms:

  • single global string ("mean", "median", "random"), or

  • single global function (called with the k nearest observed values), or

  • named list assigning methods per variable, or

  • NULL values inside such lists, which fall back to "mean" Semantics:

  • "mean": mean of the k nearest neighbors

  • "median": median of the k nearest neighbors

  • "random": random draw of one among the k nearest neighbors

  • function: custom aggregator returning one numeric value

learner_params

Hyperparameters for the chosen methods. Can be provided in three ways:

  • Per variable (e.g. list(mpg = list(num.trees = 500)))

  • Per method (e.g. list(ranger = list(num.trees = 600)))

  • Global, applied to all variables using the same method

formula

Optional modeling formula to restrict or transform predictor variables. Only supported for regularized (glmnet) and robust (lmrob/glmrob) methods Provide as a named list, e.g.:

  • list(mpg = mpg ~ hp + drat)

  • list(hp = log(hp) ~ wt + cyl) For X: follows the rules of model.matrix For Y: transformations supported are log(), exp(), sqrt(), I(1/..). Only applicable for numeric variables.

sequential

If TRUE, all variables with missing data are imputed sequentially across iterations.

nseq

Maximum number of iterations (if sequential is TRUE).

eps

Convergence threshold: the imputation process stops early if predictions change less than this amount across iterations.

imp_var

If TRUE, additional columns indicating imputed values (VAR_imp) are added.

pred_history

If TRUE, all predicted values across all iterations are stored.

tune

Hyperparameter tuning flag. Can be:

  • TRUE/FALSE globally

  • or a list specifying tuning per variable, e.g. list(var1 = TRUE) Tuning is performed halfway through nseq iterations.

verbose

If TRUE additional debugging output is provided

Value

Either:

  • the imputed dataset (default), or

  • a list containing the imputed dataset and prediction history, depending on the pred_history and tune settings.

Examples

if (FALSE) { # \dontrun{
x <- vimpute(data = sleep, sequential = FALSE)
y <- vimpute(data = sleep, sequential = TRUE, nseq = 3)
z <- vimpute(data = sleep, considered_variables =
       c("Sleep", "Dream", "Span", "BodyWgt"), sequential = FALSE)
} # }