Probabilistic linkage of large public health data files

Stat Med. 1995;14(5-7):491-8. doi: 10.1002/sim.4780140510.


Probabilistic linkage technology makes it feasible and efficient to link large public health databases in a statistically justifiable manner. The problem addressed by the methodology is that of matching two files of individual data under conditions of uncertainty. Each field is subject to error which is measured by the probability that the field agrees given a record pair matches (called the m probability) and probabilities of chance agreement of its value states (called the u probability). Fellegi and Sunter pioneered record linkage theory. Advances in methodology include use of an EM algorithm for parameter estimation, optimization of matches by means of a linear sum assignment program, and more recently, a probability model that addresses both m and u probabilities for all value states of a field. This provides a means for obtaining greater precision from non-uniformly distributed fields, without the theoretical complications arising from frequency-based matching alone. The model includes an iterative parameter estimation procedure that is more robust than pre-match estimation techniques. The methodology was originally developed and tested by the author at the U.S. Census Bureau for census undercount estimation. The more recent advances and a new generalized software system were tested and validated by linking highway crashes to Emergency Medical Service (EMS) reports and to hospital admission records for the National Highway Traffic Safety Administration (NHTSA).

MeSH terms

  • Adult
  • Algorithms
  • Child
  • Databases, Factual*
  • Humans
  • Mathematical Computing
  • Medical Record Linkage / methods*
  • Models, Statistical
  • Probability*
  • Public Health*
  • Software*