Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Aug 10;10:961-970.
doi: 10.2147/CLEP.S170075. eCollection 2018.

CDEGenerator: An Online Platform to Learn From Existing Data Models to Build Model Registries

Free PMC article

CDEGenerator: An Online Platform to Learn From Existing Data Models to Build Model Registries

Julian Varghese et al. Clin Epidemiol. .
Free PMC article


Objective: Best-practice data models harmonize semantics and data structure of medical variables in clinical or epidemiological studies. While there exist several published data sets, it remains challenging to find and reuse published eligibility criteria or other data items that match specific needs of a newly planned study or registry. A novel Internet-based method for rapid comparison of published data models was implemented to enable reuse, customization, and harmonization of item catalogs for the early planning and development phase of research databases.

Methods: Based on prior work, a European information infrastructure with a large collection of medical data models was established. A newly developed analysis module called CDEGenerator provides systematic comparison of selected data models and user-tailored creation of minimum data sets or harmonized item catalogs. Usability was assessed by eight external medical documentation experts in a workshop by the umbrella organization for networked medical research in Germany with the System Usability Scale.

Results: The analysis and item-tailoring module provides multilingual comparisons of semantically complex eligibility criteria of clinical trials. The System Usability Scale yielded "good usability" (mean 75.0, range 65.0-92.5). User-tailored models can be exported to several data formats, such as XLS, REDCap or Operational Data Model by the Clinical Data Interchange Standards Consortium, which is supported by the US Food and Drug Administration and European Medicines Agency for metadata exchange of clinical studies.

Conclusion: The online tool provides user-friendly methods to reuse, compare, and thus learn from data items of standardized or published models to design a blueprint for a harmonized research database.

Keywords: Unified Medical Language System; common data elements; metadata repositories; semantic interoperability.

Conflict of interest statement

Disclosure The authors report no conflicts of interest in this work.


Figure 1
Figure 1
An online platform to share, analyze, and reuse medical data models. Notes: Raw material from original sources is processed into a standardized data-model format (Clinical Data Interchange Standards Consortium operational data model) and enriched with language-independent semantic codes by the content-development team before uploading to the Internet-based platform. This provides open access (via the Medical Data Models Portal), advanced semantic comparison, and generation of user-tailored item catalogs (by CDEGenerator) that serve as blueprints for harmonized research databases.
Figure 2
Figure 2
Screenshot of CDEGenerator: top medical concepts. Notes: Image shows the ten most frequent medical concepts of eligibility criteria of five different diabetes mellitus type 2 studies, which are identified by their NCT numbers. The most frequent concept, “diabetes mellitus, non-insulin-dependent”, occurred in all five studies (indicated by the # All column), since its diagnosis was required for study inclusion. The second-most frequent concept, “glycosylated hemoglobin A”, is expanded in this image: the first original item question consists of multiple lines of text. CDEGenerator was able to decompose this text to the two medically relevant concepts “diabetes mellitus, non-insulin-dependent” (assigned to the top concept) and “glycosylated hemoglobin A” (current expanded concept). All the listed data types are Boolean (meaning that the answer to that item is true or false), because each eligibility criterion is either fulfilled or not.
Figure 3
Figure 3
Screenshot of CDEGenerator: data item details. Notes: If an item contains a coded list (eg, classifications) with defined permissible values, it can be expanded further to view the permissible values. If the user chooses to add an item to the cart (“Add to cart” checkbox), full item details (UMLS coding, question, data type, and code list) will be included in a resulting item catalog, which can later be downloaded in various platform-independent formats to build a research database. Abbreviation: NYHA, New York Heart Association.
Figure 4
Figure 4
Cumulative coverage plot similarity matrix of selected sources. Notes: (A) By hovering along the x-axis, the user can choose the set size of the most frequent concepts and immediately see coverage of all concept occurrences. For instance, the 13 most frequent concepts cover 28% of all 108 concept occurrences and the 20 most frequent concepts cover 35%. Those concepts can be viewed in detail within the concept list (see Figures 2 and 3). The dashed diagonal line indicates a possible graph if all the concepts had occurred only once, and thus has a constant linear slope. Therefore, initially high deviation of the actual graph (blue solid line) from the dashed line indicates existence of highly repetitive concepts within the sources. (B) Each cell contains the number of common concepts of two sources. Two additional numbers provide percentages that represent the relative overlap between source 1 and source 2. For instance, the second cell provides concept overlaps between eligibility criteria of studies NCT00592527 and NCT00641251. There are three common concepts, which can be reviewed in detail (see Figures 2 and 3) upon the user clicking. Since the first study contains only 12 concepts and the second 17, the relative overlap is higher in the first study (25.0% vs 17.6%). The redder each cell is, the higher the first percentage value, which indicates the overlap of source 1 in source 2. Blue font indicates the number of common concepts for each cell.

Similar articles

See all similar articles

Cited by 3 articles


    1. Richesson RL, Krischer J. Data standards in clinical research: gaps, overlaps, challenges and future directions. J Am Med Inform Assoc. 2007;14(6):687–696. - PMC - PubMed
    1. Noy NF, Shah NH, Whetzel PL, et al. BioPortal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Res. 2009;37:W170–W173. - PMC - PubMed
    1. Dugas M. Sharing clinical trial data. Lancet. 2016;387(10035):2287. - PubMed
    1. Reisinger SJ, Ryan PB, O’Hara DJ, et al. Development and evaluation of a common data model enabling active drug safety surveillance using disparate healthcare databases. J Am Med Inform Assoc. 2010;17(6):652–662. - PMC - PubMed
    1. Dugas M, Neuhaus P, Meidt A, et al. Portal of medical data models: information infrastructure for medical research and healthcare. Database (Oxford) 2016;2016:bav121. - PMC - PubMed

LinkOut - more resources