Extracting principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system

BMC Med Inform Decis Mak. 2006 Jul 26;6:30. doi: 10.1186/1472-6947-6-30.


Background: The text descriptions in electronic medical records are a rich source of information. We have developed a Health Information Text Extraction (HITEx) tool and used it to extract key findings for a research study on airways disease.

Methods: The principal diagnosis, co-morbidity and smoking status extracted by HITEx from a set of 150 discharge summaries were compared to an expert-generated gold standard.

Results: The accuracy of HITEx was 82% for principal diagnosis, 87% for co-morbidity, and 90% for smoking status extraction, when cases labeled "Insufficient Data" by the gold standard were excluded.

Conclusion: We consider the results promising, given the complexity of the discharge summaries and the extraction tasks.

Publication types

  • Evaluation Study
  • Research Support, N.I.H., Extramural

MeSH terms

  • Asthma / complications
  • Asthma / diagnosis*
  • Comorbidity
  • Humans
  • International Classification of Diseases
  • Medical Records Systems, Computerized / standards*
  • Natural Language Processing*
  • Patient Discharge*
  • Pulmonary Disease, Chronic Obstructive / complications
  • Pulmonary Disease, Chronic Obstructive / diagnosis
  • Sensitivity and Specificity
  • Smoking / epidemiology