Assumptions made when preparing drug exposure data for analysis have an impact on results: An unreported step in pharmacoepidemiology studies

Pharmacoepidemiol Drug Saf. 2018 Jul;27(7):781-788. doi: 10.1002/pds.4440. Epub 2018 Apr 17.


Purpose: Real-world data for observational research commonly require formatting and cleaning prior to analysis. Data preparation steps are rarely reported adequately and are likely to vary between research groups. Variation in methodology could potentially affect study outcomes. This study aimed to develop a framework to define and document drug data preparation and to examine the impact of different assumptions on results.

Methods: An algorithm for processing prescription data was developed and tested using data from the Clinical Practice Research Datalink (CPRD). The impact of varying assumptions was examined by estimating the association between 2 exemplar medications (oral hypoglycaemic drugs and glucocorticoids) and cardiovascular events after preparing multiple datasets derived from the same source prescription data. Each dataset was analysed using Cox proportional hazards modelling.

Results: The algorithm included 10 decision nodes and 54 possible unique assumptions. Over 11 000 possible pathways through the algorithm were identified. In both exemplar studies, similar hazard ratios and standard errors were found for the majority of pathways; however, certain assumptions had a greater influence on results. For example, in the hypoglycaemic analysis, choosing a different variable to define prescription end date altered the hazard ratios (95% confidence intervals) from 1.77 (1.56-2.00) to 2.83 (1.59-5.04).

Conclusions: The framework offers a transparent and efficient way to perform and report drug data preparation steps. Assumptions made during data preparation can impact the results of analyses. Improving transparency regarding drug data preparation would increase the repeatability, reproducibility, and comparability of published results.

Keywords: data preparation; pharmacoepidemiology; reproducibility; transparency.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Arthritis, Rheumatoid / drug therapy
  • Cardiovascular Diseases / chemically induced
  • Data Interpretation, Statistical
  • Databases, Factual
  • Diabetes Mellitus, Type 2 / drug therapy
  • Glucocorticoids / adverse effects*
  • Glucocorticoids / therapeutic use*
  • Humans
  • Hypoglycemic Agents / adverse effects*
  • Hypoglycemic Agents / therapeutic use*
  • Pharmacoepidemiology / methods*
  • Reproducibility of Results
  • Research Design
  • Risk Factors


  • Glucocorticoids
  • Hypoglycemic Agents