Data Quality in Electronic Health Record Research: An Approach for Validation and Quantitative Bias Analysis for Imperfectly Ascertained Health Outcomes Via Diagnostic Codes

Neal D Goldstein; Deborah Kahal; Karla Testa; Ed J Gracely; Igor Burstyn

doi:10.1162/99608f92.cbe67e91

Data Quality in Electronic Health Record Research: An Approach for Validation and Quantitative Bias Analysis for Imperfectly Ascertained Health Outcomes Via Diagnostic Codes

Harv Data Sci Rev. 2022 Spring;4(2):10.1162/99608f92.cbe67e91. doi: 10.1162/99608f92.cbe67e91. Epub 2022 Apr 28.

Authors

Neal D Goldstein¹, Deborah Kahal², Karla Testa³, Ed J Gracely⁴, Igor Burstyn⁵

Affiliations

¹ Department of Epidemiology and Biostatistics, Dornsife School of Public Health, Drexel University, Philadelphia, Pennsylvania, United States of America.
² William J. Holloway Community Program, ChristianaCare, Wilmington, Delaware, United States of America; Sydney Kimmel College of Medicine, Thomas Jefferson University, Philadelphia, Pennsylvania, United States of America.
³ Sydney Kimmel College of Medicine, Thomas Jefferson University, Philadelphia, Pennsylvania, United States of America; Westside Family Healthcare, Wilmington, Delaware, United States of America.
⁴ Department of Epidemiology and Biostatistics, Dornsife School of Public Health, Drexel University, Philadelphia, Pennsylvania, United States of America; Department of Family, Community, and Preventive Medicine, College of Medicine, Drexel University, Philadelphia, Pennsylvania, United States of America.
⁵ Department of Epidemiology and Biostatistics, Dornsife School of Public Health, Drexel University, Philadelphia, Pennsylvania, United States of America; Department of Environmental and Occupational Health, Dornsife School of Public Health, Drexel University, Philadelphia, Pennsylvania, United States of America.

Abstract

It is incumbent upon all researchers who use the electronic health record (EHR), including data scientists, to understand the quality of such data. EHR data may be subject to measurement error or misclassification that have the potential to bias results, unless one applies the available computational techniques specifically created for this problem. In this article, we begin with a discussion of data-quality issues in the EHR focusing on health outcomes. We review the concepts of sensitivity, specificity, positive and negative predictive values, and demonstrate how the imperfect classification of a dichotomous outcome variable can bias an analysis, both in terms of prevalence of the outcome, and relative risk of the outcome under one treatment regime (aka exposure) compared to another. This is then followed by a description of a generalizable approach to probabilistic (quantitative) bias analysis using a combination of regression estimation of the parameters that relate the true and observed data and application of these estimates to adjust the prevalence and relative risk that may have existed if there was no misclassification. We describe bias analysis that accounts for both random and systematic errors and highlight its limitations. We then motivate a case study with the goal of validating the accuracy of a health outcome, chronic infection with hepatitis C virus, derived from a diagnostic code in the EHR. Finally, we demonstrate our approaches on the case study and conclude by summarizing the literature on outcome misclassification and quantitative bias analysis.

Keywords: International Classification of Diseases; bias; data quality; electronic health record; hepatitis C; validation.

Grants and funding

K01 AI143356/AI/NIAID NIH HHS/United States