Improving record linkage performance in the presence of missing linkage data

J Biomed Inform. 2014 Dec:52:43-54. doi: 10.1016/j.jbi.2014.01.016. Epub 2014 Feb 10.

Abstract

Introduction: Existing record linkage methods do not handle missing linking field values in an efficient and effective manner. The objective of this study is to investigate three novel methods for improving the accuracy and efficiency of record linkage when record linkage fields have missing values.

Methods: By extending the Fellegi-Sunter scoring implementations available in the open-source Fine-grained Record Linkage (FRIL) software system we developed three novel methods to solve the missing data problem in record linkage, which we refer to as: Weight Redistribution, Distance Imputation, and Linkage Expansion. Weight Redistribution removes fields with missing data from the set of quasi-identifiers and redistributes the weight from the missing attribute based on relative proportions across the remaining available linkage fields. Distance Imputation imputes the distance between the missing data fields rather than imputing the missing data value. Linkage Expansion adds previously considered non-linkage fields to the linkage field set to compensate for the missing information in a linkage field. We tested the linkage methods using simulated data sets with varying field value corruption rates.

Results: The methods developed had sensitivity ranging from .895 to .992 and positive predictive values (PPV) ranging from .865 to 1 in data sets with low corruption rates. Increased corruption rates lead to decreased sensitivity for all methods.

Conclusions: These new record linkage algorithms show promise in terms of accuracy and efficiency and may be valuable for combining large data sets at the patient level to support biomedical and clinical research.

Keywords: Comparative effectiveness research; Data quality; Missing data; Quasi-identifiers; Record linkage.

Publication types

  • Research Support, U.S. Gov't, P.H.S.

MeSH terms

  • Algorithms
  • Biomedical Research / methods*
  • Biomedical Research / standards*
  • Humans
  • Medical Informatics*
  • Medical Record Linkage / methods*
  • Medical Record Linkage / standards*
  • Research Design