Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020;7(1):37.
doi: 10.1186/s40537-020-00313-w. Epub 2020 Jun 12.

SICE: an improved missing data imputation technique

Affiliations
Free PMC article

SICE: an improved missing data imputation technique

Shahidul Islam Khan et al. J Big Data. 2020.
Free PMC article

Abstract

In data analytics, missing data is a factor that degrades performance. Incorrect imputation of missing values could lead to a wrong prediction. In this era of big data, when a massive volume of data is generated in every second, and utilization of these data is a major concern to the stakeholders, efficiently handling missing values becomes more important. In this paper, we have proposed a new technique for missing data imputation, which is a hybrid approach of single and multiple imputation techniques. We have proposed an extension of popular Multivariate Imputation by Chained Equation (MICE) algorithm in two variations to impute categorical and numeric data. We have also implemented twelve existing algorithms to impute binary, ordinal, and numeric missing values. We have collected sixty-five thousand real health records from different hospitals and diagnostic centers of Bangladesh, maintaining the privacy of data. We have also collected three public datasets from the UCI Machine Learning Repository, ETH Zurich, and Kaggle. We have compared the performance of our proposed algorithms with existing algorithms using these datasets. Experimental results show that our proposed algorithm achieves 20% higher F-measure for binary data imputation and 11% less error for numeric data imputations than its competitors with similar execution time.

Keywords: Data Analytics; MICE; Missing Data Imputation; Multiple Imputation; Single Imputation.

PubMed Disclaimer

Conflict of interest statement

Competing interestsThe authors do not have any competing interests.

Figures

Fig. 1
Fig. 1
Regression lines from two sets of random 100 data taken from 1000 library fine data
Fig. 2
Fig. 2
MICE flowchart
Fig. 3
Fig. 3
Flowchart of SICE
Fig. 4
Fig. 4
Block diagram of the system
Fig. 5
Fig. 5
Accuracy and F-measure for four algorithms to impute gender attribute
Fig. 6
Fig. 6
Performance comparison of MICE and SICE for additional binary datasets
Fig. 7
Fig. 7
Performance of MICE and SICE for ordinal data using PMM and POLYREG
Fig. 8
Fig. 8
Comparison of execution time of MICE and SICE to impute UCI car dataset
Fig. 9
Fig. 9
Performance of algorithms to predict house prices

Similar articles

Cited by

References

    1. Lee Choong Ho, Yoon Hyung-Jin. Medical big data: promise and challenges. Kidney Res Clin Pract. 2017;36(1):3. doi: 10.23876/j.krcp.2017.36.1.3. - DOI - PMC - PubMed
    1. Tsai Chun-Wei, Lai Chin-Feng, Chao Han-Chieh, Vasilakos Athanasios V. Big data analytics: a survey. J Big Data. 2015;2(1):21. doi: 10.1186/s40537-015-0030-3. - DOI
    1. Brown ML, Kros JF. Data mining and the impact of missing data. Ind Manag Data Syst. 2003;103(8):611–621. doi: 10.1108/02635570310497657. - DOI
    1. Fan Jianqing, Han Fang, Liu Han. Challenges of big data analysis. National Sci Rev. 2014;1(2):293–314. doi: 10.1093/nsr/nwt032. - DOI - PMC - PubMed
    1. Rahm Erhard, Do Hong Hai. Data cleaning: problems and current approaches. IEEE Data Eng Bull. 2000;23(4):3–13.

LinkOut - more resources