Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Filters applied. Clear all
. 2018 Jun 14;173(7):1692-1704.e11.
doi: 10.1016/j.cell.2018.04.032. Epub 2018 May 17.

Disease Heritability Inferred From Familial Relationships Reported in Medical Records

Free PMC article

Disease Heritability Inferred From Familial Relationships Reported in Medical Records

Fernanda C G Polubriaginof et al. Cell. .
Free PMC article


Heritability is essential for understanding the biological causes of disease but requires laborious patient recruitment and phenotype ascertainment. Electronic health records (EHRs) passively capture a wide range of clinically relevant data and provide a resource for studying the heritability of traits that are not typically accessible. EHRs contain next-of-kin information collected via patient emergency contact forms, but until now, these data have gone unused in research. We mined emergency contact data at three academic medical centers and identified 7.4 million familial relationships while maintaining patient privacy. Identified relationships were consistent with genetically derived relatedness. We used EHR data to compute heritability estimates for 500 disease phenotypes. Overall, estimates were consistent with the literature and between sites. Inconsistencies were indicative of limitations and opportunities unique to EHR research. These analyses provide a validation of the use of EHRs for genetics and disease research.

Keywords: data mining; disease heritability; electronic health record; familial relationships; family history; genetics; observational databases.

Conflict of interest statement

Declaration of Interests

The authors declare no competing interests.


Fig. 1
Fig. 1. Inference of familial relationships and estimation of heritability from the electronic health records
At Columbia, 680,000 reported next-of-kin data were identified in the institutional EHR. Similarly, 430,000 and 780,000 were identified at Weill Cornell and Mount Sinai, respectively. From these initial relationships, we were able to infer additional relationships resulting in 3.2 million patient relationships at Columbia, 1.5 million relationships at Weill Cornell, and 2.6 million relationships at Mount Sinai. A family was identified as a group of patients with no relationships outside of the group. In total, we identified 223,000 families at Columbia, 155,000 families at Weill Cornell, and 187,000 at Mount Sinai. The largest 400 families from Columbia were visualized as a graph using a force layout (Materials and Methods). Each disconnected subgraph is a family. Each node is an individual. Solid nodes represent patients in our respective EHRs. Colored nodes indicate the presence of a disease diagnosis in one of four classes: cardiovascular disease (red), musculoskeletal disease (purple), metabolic disease (blue), and skin disease (green). The top left shows 93 of the top families at Columbia. The largest family shown contains 23 individuals and the smallest, 12. We constructed detailed pedigrees for one family from Columbia (bottom left). The pedigree shown was modified for de-identification purposes. Each node is an individual. Individuals indicated by dashed lines are inferred to exist but did not exist in the EHR. The top right shows a map of the number of individuals from Columbia for whom relationships were identified. The colors represent the number of individuals that live in each ZIP code. The bottom right shows a bar graph shows the number of individuals by relationship type for each institution. We used all disease diagnosis data and clinical pathology report data (laboratory tests) available for patients in our cohort to study genetic heritability. At Columbia, 6.6 million disease diagnoses were used to estimate heritability of dichotomous traits and 42 million laboratory tests were used to estimate heritability of quantitative traits. At Weill Cornell, 3 million disease diagnoses were used and 16 million laboratory tests and at Mount Sinai, 4 million disease diagnosis.
Fig. 2
Fig. 2. Validation of familial relationships inferred from the EHR
(A) The medical centers at both Columbia and Weill Cornell have implemented a link between the electronic health records of mother and baby at the time of birth. We used these links as a gold standard to evaluate RIFTEHR, our algorithm for automatically inferring relationships from the EHR. We also inferred siblings using the mother-baby link data. (B) Through biobanks at Columbia, 302 of the patients with identified relationships from RIFTEHR also had genetic data available and appropriately consented for use in our study. For these, RIFTEHR predicted a total of 172 relationships. Genetic relatedness was determined for each pair of individuals. Almost all 134 parent/child relationships had the expected genetic relatedness of 50% (51%±3%). Of the siblings predicted by RIFTEHR 19 were full siblings, 3 were half siblings (genetic relatedness of 25%), and 4 were identical twins. The high rate of twins in our small sample is a result of the secondary use of existing data – which was originally collected for genetic studies. Excluding these twins yields a more accurate estimate of RIFTEHR’s performance (PPV=86.4%). Overall the RIFTEHR relationship and the genetic relationship were significantly correlated (r=0.60, p=1.81e-18). (C) Average age differences for each relationship type. We computed the age differences for each pair of individuals at Columbia (blue), Weill Cornell (red) and Mount Sinai (purple). The age differences are consistent across sites. (D) At Mount Sinai, we identified 1,222 patients that had familial relationships from RIFTEHR and also had genetic data available with appropriate consent for use in our study. Among these, RIFTEHR inferred 937 relationships. Genetic relatedness was determined for each individual pair and compared to the relationships inferred by RIFTEHR. RIFTEHR’s performance varied from 32% to 91% PPV, being more accurate in identifying members of the nuclear family. Overall the RIFTEHR relationship and the genetic relationship were significantly correlated (r=0.67, p<1.2e-162).
Fig. 3
Fig. 3. Validation of SOLARStrap accuracy and robustness using simulated data
(A) Traits with heritability ranging from 5% to 95% were generated using the SOLAR. We used actual family structures extracted from the EHR by RIFTEHR to generate the simulated traits. We then created dichotomous (binary) versions of the trait by choosing a threshold that would yield a trait with 15% prevalence. SOLAR was very accurate at recapitulating the correct heritability for both quantitative (r2 = 0.999) and binary (r2 = 0.994) traits. In (B), (C) and (D), the number of families varied from 100 to 1000, being represented by different colors. (B) SOLARStrap was run on each of the simulated quantitative traits and was accurate at estimating the true heritability (r2 = 0.986). SOLARStrap was accurate regardless of the number of families that was used in the sampling procedure (left). (C) SOLARStrap was run on each of the binary traits in the setting of complete ascertainment. SOLARStrap achieved equal accuracy as in the quantitative case (r2 = 0.988). (D) SOLARStrap was run on each of the binary traits in the setting of incomplete ascertainment. In this case families without any cases were dropped and a proband was randomly assigned in each family. The accuracy is lower than the case of complete ascertainment (r2 = 0.930). (E) In the presence of randomly missing information, both SOLAR and SOLARStrap produce accurate estimates of the true heritability even when up to 60% of the data are removed. However, in four cases where the proportion removed was 35%, 45%, and above 50% SOLARStrap estimates did not pass our internal quality control criteria. (F) SOLAR is sensitive to this bias and produces inaccurate results as the strength of the bias increases. SOLARStrap is robust to these biases and produces accurate estimates of heritability even in the most extreme case of bias. (G) As the number of families sampled increases toward the total number of available families SOLARStrap becomes more sensitive to bias – in the most extreme case where the number of sampled families is equal to the total number of available families SOLARStrap reduces to simply running SOLAR. (H) The estimate of heritability is not dependent on the number of families sampled (r=0.02, p=4.1e-8). (I) The Proportion of Significant Attempts (POSA) is a primary estimate of quality for heritability estimates produced by SOLARStrap. The accuracy of SOLARStrap increases as the POSA increases (shown as error here). (J) The effect of noise injection on the estimate of observational heritability of rhinitis. We injected noise into the data by randomly shuffling a subset of the patient diagnoses. This simulates misclassification (misdiagnosis or missed diagnosis) in the medical records. When no noise is injected the estimate is 0.77 (0.60–0.92). As noise is introduced the estimate of the heritability decreases to 0.36 (0.23–0.49) once one quarter of the data are randomized.
Fig. 4
Fig. 4. Estimating heritability of disease using electronic health records
We designed a method, called SOLARStrap, for estimating the heritability of traits where the phenotype is derived under unknown ascertainment biases, the ho2. (A) We found that performance was consistent across sites and (B) that ho2 is significantly correlated with literature estimates of h2. (C) Heritability estimates stratified by race and ethnicity using the AE model are correlated with estimates of ho2. (D) These models are also correlated when computing heritability estimates for ICD10 codes alone. (E) Heritability of traits that have been studied before, such as height, have been recapitulated by our study. We also stratified heritability of height by self-reported race and ethnicity as available in EHR. (F) Observational heritability of HDL cholesterol (blue) is significantly higher than heritability of LDL cholesterol (red). This difference is still observed after stratifying patients by the presence or absence HMG-CoA reductase inhibitors as treatment for hypercholesterolemia.

Comment in

Similar articles

See all similar articles

Cited by 12 articles

See all "Cited by" articles

Publication types

LinkOut - more resources