A late-binding, distributed, NoSQL warehouse for integrating patient data from clinical trials

Database (Oxford). 2019 Jan 1:2019:baz032. doi: 10.1093/database/baz032.

Abstract

Clinical trial data are typically collected through multiple systems developed by different vendors using different technologies and data standards. That data need to be integrated, standardized and transformed for a variety of monitoring and reporting purposes. The need to process large volumes of often inconsistent data in the presence of ever-changing requirements poses a significant technical challenge. As part of a comprehensive clinical data repository, we have developed a data warehouse that integrates patient data from any source, standardizes it and makes it accessible to study teams in a timely manner to support a wide range of analytic tasks for both in-flight and completed studies. Our solution combines Apache HBase, a NoSQL column store, Apache Phoenix, a massively parallel relational query engine and a user-friendly interface to facilitate efficient loading of large volumes of data under incomplete or ambiguous specifications, utilizing an extract-load-transform design pattern that defers data mapping until query time. This approach allows us to maintain a single copy of the data and transform it dynamically into any desirable format without requiring additional storage. Changes to the mapping specifications can be easily introduced and multiple representations of the data can be made available concurrently. Further, by versioning the data and the transformations separately, we can apply historical maps to current data or current maps to historical data, which simplifies the maintenance of data cuts and facilitates interim analyses for adaptive trials. The result is a highly scalable, secure and redundant solution that combines the flexibility of a NoSQL store with the robustness of a relational query engine to support a broad range of applications, including clinical data management, medical review, risk-based monitoring, safety signal detection, post hoc analysis of completed studies and many others.

MeSH terms

  • Clinical Trials as Topic*
  • Data Warehousing*
  • Database Management Systems*
  • Humans
  • Machine Learning
  • User-Computer Interface