Background: Health insurance claims (ie, receipts) record patient health care treatments and expenses and, although created for the health care payment system, are potentially useful for research. Combining different types of receipts generated for the same patient would dramatically increase the utility of these receipts. However, technical problems, including standardization of disease names and classifications, and anonymous linkage of individual receipts, must be addressed.
Methods: In collaboration with health insurance societies, all information from receipts (inpatient, outpatient, and pharmacy) was collected. To standardize disease names and classifications, we developed a computer-aided post-entry standardization method using a disease name dictionary based on International Classification of Diseases (ICD)-10 classifications. We also developed an anonymous linkage system by using an encryption code generated from a combination of hash values and stream ciphers. Using different sets of the original data (data set 1: insurance certificate number, name, and sex; data set 2: insurance certificate number, date of birth, and relationship status), we compared the percentage of successful record matches obtained by using data set 1 to generate key codes with the percentage obtained when both data sets were used.
Results: The dictionary's automatic conversion of disease names successfully standardized 98.1% of approximately 2 million new receipts entered into the database. The percentage of anonymous matches was higher for the combined data sets (98.0%) than for data set 1 (88.5%).
Conclusions: The use of standardized disease classifications and anonymous record linkage substantially contributed to the construction of a large, chronologically organized database of receipts. This database is expected to aid in epidemiologic and health services research using receipt information.