For data, a subset of
sleep is used. The columns have been selected deliberately to include some interactions between the missing values.
str(dataset) #> 'data.frame': 62 obs. of 4 variables: #> $ Dream : num NA 2 NA NA 1.8 0.7 3.9 1 3.6 1.4 ... #> $ NonD : num NA 6.3 NA NA 2.1 9.1 15.8 5.2 10.9 8.3 ... #> $ BodyWgt: num 8.803 0 1.2194 -0.0834 7.8427 ... #> $ Span : num 3.65 1.5 2.64 NA 4.23 ...
In order to invoke the imputation methods, a formula is used to specify which variables are to be estimated and which variables should be used as regressors. We will start by imputing
NonD based in
imp_regression <- regressionImp(NonD ~ BodyWgt + Span, dataset) #> There still missing values in variable NonD . Probably due to missing values in the regressors. imp_ranger <- rangerImpute(NonD ~ BodyWgt + Span, dataset) aggr(imp_regression, delimiter = "_imp")
We can see that for
regrssionImp() there are still missings in
NonD for all observations where
Span is unobserved. This is because the regression model could not be applied to those observations. The same is true for the values imputed via
As we can see in the next two plots, the correlation structure of
BodyWgt is preserved by both imputation methods. In the case of
regressionImp() all imputed values almost follow a straight line. This suggests that the variable
Span had little to no effect on the model.
rangerImpute() on the other hand,
Span played an important role in the generation of the imputed values.
imp_regression <- regressionImp(Dream + NonD ~ BodyWgt + Span, dataset) #> There still missing values in variable Dream . Probably due to missing values in the regressors. #> There still missing values in variable NonD . Probably due to missing values in the regressors. imp_ranger <- rangerImpute(Dream + NonD ~ BodyWgt + Span, dataset) aggr(imp_regression, delimiter = "_imp")
Again, there are missings left for both
In order to validate the performance of
iris dataset is used. Firstly, some values are randomly set to
library(reactable) data(iris) df <- iris colnames(df) <- c("S.Length","S.Width","P.Length","P.Width","Species") # randomly produce some missing values in the data set.seed(1) nbr_missing <- 50 y <- data.frame(row=sample(nrow(iris),size = nbr_missing,replace = T), col=sample(ncol(iris)-1,size = nbr_missing,replace = T)) y<-y[!duplicated(y),] df[as.matrix(y)]<-NA aggr(df)
We can see that there are missings in all variables and some observations reveal missing values on several points. In the next step we perform a multiple variable imputation and
Species serves as a regressor.
imp_regression <- regressionImp(S.Length + S.Width + P.Length + P.Width ~ Species, df) aggr(imp_regression, delimiter = "imp")
The plot indicates that all missing values have been imputed by the
regressionImp() algorithm. The following table displays the rounded first five results of the imputation for all variables.