A proficient cost reduction framework for de-duplication of records in data integration

Asif Sohail; Muhammad Murtaza Yousaf

doi:10.1186/s12911-016-0280-9

A proficient cost reduction framework for de-duplication of records in data integration

BMC Med Inform Decis Mak. 2016 Apr 12:16:42. doi: 10.1186/s12911-016-0280-9.

Authors

Asif Sohail¹, Muhammad Murtaza Yousaf²

Affiliations

¹ Punjab University College of Information Technology (PUCIT), University of the Punjab, Lahore, Pakistan. asif@pucit.edu.pk.
² Punjab University College of Information Technology (PUCIT), University of the Punjab, Lahore, Pakistan.

Abstract

Background: Record de-duplication is a process of identifying the records referring to the same entity. It has a pivotal role in data mining applications, which involves the integration of multiple data sources and data cleansing. It has been a challenging task due to its computational complexity and variations in data representations across different data sources. Blocking and windowing are the commonly used methods for reducing the number of record comparisons during record de-duplication. Both blocking and windowing require tuning of a certain set of parameters, such as the choice of a particular variant of blocking or windowing, the selection of appropriate window size for different datasets etc.

Methods: In this paper, we have proposed a framework that employs blocking and windowing techniques in succession, such that figuring out the parameters is not required. We have also evaluated the impact of different configurations on dirty and massively dirty datasets. To evaluate the proposed framework, experiments are performed using Febrl (Freely Extensible Biomedical Record Linkage).

Results: The proposed framework is comprehensively evaluated using a variety of quality and complexity parameters such as reduction ratio, precision, recall etc. It is observed that the proposed framework significantly reduces the number of record comparisons.

Conclusions: The selection of the linkage key is a critical performance factor for record linkage.

Keywords: Data integration; Inverted index; Record comparison reduction; Record linkage/de-duplication.

MeSH terms

Humans
Medical Informatics / methods*
Medical Record Linkage / methods*