High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP)

Yichi Zhang; Tianrun Cai; Sheng Yu; Kelly Cho; Chuan Hong; Jiehuan Sun; Jie Huang; Yuk-Lam Ho; Ashwin N Ananthakrishnan; Zongqi Xia; Stanley Y Shaw; Vivian Gainer; Victor Castro; Nicholas Link; Jacqueline Honerlaw; Sicong Huang; David Gagnon; Elizabeth W Karlson; Robert M Plenge; Peter Szolovits; Guergana Savova; Susanne Churchill; Christopher O'Donnell; Shawn N Murphy; J Michael Gaziano; Isaac Kohane; Tianxi Cai; Katherine P Liao

doi:10.1038/s41596-019-0227-6

High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP)

Nat Protoc. 2019 Dec;14(12):3426-3444. doi: 10.1038/s41596-019-0227-6. Epub 2019 Nov 20.

Authors

Yichi Zhang^#¹, Tianrun Cai^#², Sheng Yu^#^{3

4}, Kelly Cho^{5

6}, Chuan Hong¹, Jiehuan Sun¹, Jie Huang², Yuk-Lam Ho⁵, Ashwin N Ananthakrishnan⁷, Zongqi Xia⁸, Stanley Y Shaw⁹, Vivian Gainer¹⁰, Victor Castro¹⁰, Nicholas Link⁵, Jacqueline Honerlaw⁵, Sicong Huang², David Gagnon^{5

11}, Elizabeth W Karlson², Robert M Plenge², Peter Szolovits¹², Guergana Savova¹³, Susanne Churchill¹⁴, Christopher O'Donnell^{5

15}, Shawn N Murphy^{10

14

16}, J Michael Gaziano^{5

6}, Isaac Kohane¹⁴, Tianxi Cai^{1

14}, Katherine P Liao^{17

18

19}

Affiliations

¹ Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
² Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital, Boston, MA, USA.
³ Center for Statistical Science, Tsinghua University, Beijing, China.
⁴ Department of Industrial Engineering, Tsinghua University, Beijing, China.
⁵ Division of Data Sciences, VA Boston Healthcare System, Boston, MA, USA.
⁶ Division of Aging, Brigham and Women's Hospital, Boston, MA, USA.
⁷ Department of Gastroenterology, Massachusetts General Hospital, Boston, MA, USA.
⁸ Department of Neurology, University of Pittsburgh, Pittsburgh, PA, USA.
⁹ Division of Cardiovascular Medicine, Brigham and Women's Hospital, Boston, MA, USA.
¹⁰ Research Information Science and Computing, Partners Healthcare, Boston, MA, USA.
¹¹ Department of Biostatistics, Boston University, Boston, MA, USA.
¹² Department of Electrical Engineering and Computer Science, MIT, Cambridge, MA, USA.
¹³ Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, USA.
¹⁴ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
¹⁵ Division of Cardiology, VA Boston Healthcare System, Boston, MA, USA.
¹⁶ Department of Neurology, Massachusetts General Hospital, Boston, MA, USA.
¹⁷ Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital, Boston, MA, USA. kliao@bwh.harvard.edu.
¹⁸ Division of Data Sciences, VA Boston Healthcare System, Boston, MA, USA. kliao@bwh.harvard.edu.
¹⁹ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA. kliao@bwh.harvard.edu.

^# Contributed equally.

Abstract

Phenotypes are the foundation for clinical and genetic studies of disease risk and outcomes. The growth of biobanks linked to electronic medical record (EMR) data has both facilitated and increased the demand for efficient, accurate, and robust approaches for phenotyping millions of patients. Challenges to phenotyping with EMR data include variation in the accuracy of codes, as well as the high level of manual input required to identify features for the algorithm and to obtain gold standard labels. To address these challenges, we developed PheCAP, a high-throughput semi-supervised phenotyping pipeline. PheCAP begins with data from the EMR, including structured data and information extracted from the narrative notes using natural language processing (NLP). The standardized steps integrate automated procedures, which reduce the level of manual input, and machine learning approaches for algorithm training. PheCAP itself can be executed in 1-2 d if all data are available; however, the timing is largely dependent on the chart review stage, which typically requires at least 2 weeks. The final products of PheCAP include a phenotype algorithm, the probability of the phenotype for all patients, and a phenotype classification (yes or no).

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms
Data Analysis*
Data Interpretation, Statistical
Electronic Health Records / statistics & numerical data*
High-Throughput Screening Assays / methods*
Humans
Machine Learning
Natural Language Processing
Phenotype

Abstract

Publication types

MeSH terms

Grants and funding