Accuracy of Electronic Health Record Data for Identifying Stroke Cases in Large-Scale Epidemiological Studies: A Systematic Review from the UK Biobank Stroke Outcomes Group

PLoS One. 2015 Oct 23;10(10):e0140533. doi: 10.1371/journal.pone.0140533. eCollection 2015.


Objective: Long-term follow-up of population-based prospective studies is often achieved through linkages to coded regional or national health care data. Our knowledge of the accuracy of such data is incomplete. To inform methods for identifying stroke cases in UK Biobank (a prospective study of 503,000 UK adults recruited in middle-age), we systematically evaluated the accuracy of these data for stroke and its main pathological types (ischaemic stroke, intracerebral haemorrhage, subarachnoid haemorrhage), determining the optimum codes for case identification.

Methods: We sought studies published from 1990-November 2013, which compared coded data from death certificates, hospital admissions or primary care with a reference standard for stroke or its pathological types. We extracted information on a range of study characteristics and assessed study quality with the Quality Assessment of Diagnostic Studies tool (QUADAS-2). To assess accuracy, we extracted data on positive predictive values (PPV) and-where available-on sensitivity, specificity, and negative predictive values (NPV).

Results: 37 of 39 eligible studies assessed accuracy of International Classification of Diseases (ICD)-coded hospital or death certificate data. They varied widely in their settings, methods, reporting, quality, and in the choice and accuracy of codes. Although PPVs for stroke and its pathological types ranged from 6-97%, appropriately selected, stroke-specific codes (rather than broad cerebrovascular codes) consistently produced PPVs >70%, and in several studies >90%. The few studies with data on sensitivity, specificity and NPV showed higher sensitivity of hospital versus death certificate data for stroke, with specificity and NPV consistently >96%. Few studies assessed either primary care data or combinations of data sources.

Conclusions: Particular stroke-specific codes can yield high PPVs (>90%) for stroke/stroke types. Inclusion of primary care data and combining data sources should improve accuracy in large epidemiological studies, but there is limited published information about these strategies.

Publication types

  • Research Support, Non-U.S. Gov't
  • Review
  • Systematic Review

MeSH terms

  • Adult
  • Death Certificates
  • Electronic Health Records / standards*
  • Electronic Health Records / statistics & numerical data*
  • Hospitalization / statistics & numerical data
  • Humans
  • International Classification of Diseases / standards
  • International Classification of Diseases / statistics & numerical data
  • Middle Aged
  • Reproducibility of Results
  • Stroke / diagnosis*
  • Stroke / epidemiology*
  • United Kingdom / epidemiology