Extends the cellwise MCD approach (Raymaekers & Rousseeuw 2024) to mixed continuous + categorical data. Uses MCD for robust covariance estimation of the continuous block, computes cellwise weights from conditional residuals, then imputes missing values via conditional expectations (continuous) and weighted multinomial regression (categorical). Iterates until convergence.

imputeCellMCD(
  data,
  maxit = 50,
  eps = 0.005,
  method = "tukey",
  alpha = NULL,
  mcd_alpha = 0.75,
  hard_threshold = 0.5,
  mcd_observed = "all",
  init_method = "median",
  uncert = "conditional",
  m = 1L,
  boot = FALSE,
  trace = FALSE
)

Arguments

data

a data.frame with missing values (mixed continuous and categorical variables are supported).

maxit

maximum number of iterations (default: 50).

eps

convergence tolerance (default: 5e-3).

method

weight function for cell weights: "tukey" (default) or "huber".

alpha

tuning constant. NULL (default) uses 4.685 for Tukey and 1.345 for Huber.

mcd_alpha

MCD concentration parameter (default: 0.75).

hard_threshold

numeric in \([0, 1]\). Before iteration, observed cells with initial MCD weight below this threshold are set to missing and re-imputed (detect-once preprocessing). Set to NULL or 0 to disable (default: 0.5).

mcd_observed

strategy for covariance estimation: "all" (default) runs MCD on all data including imputed values; "weighted" uses cellWise::cwLocScat with imputed cells receiving weight 0 (requires the cellWise package); "pairwise" uses pairwise robust (Gnanadesikan–Kettenring) covariances on observed cells only.

init_method

initialisation for missing values before iteration: "median" (default), "knn", or "irmi".

uncert

imputation uncertainty: "conditional" (default) adds noise from the conditional normal distribution.

m

number of multiple imputations (default: 1). If m > 1, a list of imputed datasets is returned.

boot

logical; if TRUE, bootstrap resampling propagates parameter uncertainty across the m imputations.

trace

logical; if TRUE, print progress information.

Value

A list with components:

data_imputed

the imputed data.frame.

cellweights

\(n \times p\) matrix of cell weights. Continuous observed cells have values in \([0, 1]\); categorical columns always have weight 1 (cellwise detection is only applied to continuous variables).

mu

robust location estimate (continuous variables).

Sigma

robust covariance estimate (continuous variables).

converged

logical indicating convergence.

iterations

number of iterations performed.

References

Raymaekers, C. and Rousseeuw, P.J. (2024). The cellwise minimum covariance determinant estimator. Journal of the American Statistical Association, 119(545), 576–588.

Author

Matthias Templ

Examples

if (FALSE) { # \dontrun{
data(sleep, package = "VIM")
result <- imputeCellMCD(sleep)
head(result$data_imputed)

# Inspect cell weights
image(result$cellweights, main = "Cell weights")

# With pairwise robust covariance
result2 <- imputeCellMCD(sleep, mcd_observed = "pairwise", trace = TRUE)
} # }