Processing of big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data

Bioinformatics. 2019 Mar 1;35(5):729-736. doi: 10.1093/bioinformatics/bty688.

Abstract

Motivation: We previously proposed a paradigm shift in genomic data management, based on the Genomic Data Model (GDM) for mediating existing data formats and on the GenoMetric Query Language (GMQL) for supporting, at a high level of abstraction, data extraction and the most common data-driven computations required by tertiary data analysis of Next Generation Sequencing datasets. Here, we present a new GMQL-based system with enhanced accessibility, portability, scalability and performance.

Results: The new system has a well-designed modular architecture featuring: (i) an intermediate representation supporting many different implementations (including Spark, Flink and SciDB); (ii) a high-level technology-independent repository abstraction, supporting different repository technologies (e.g., local file system, Hadoop File System, database or others); (iii) several system interfaces, including a user-friendly Web-based interface, a Web Service interface, and a programmatic interface for Python language. Biological use case examples, using public ENCODE, Roadmap Epigenomics and TCGA datasets, demonstrate the relevance of our work.

Availability and implementation: The GMQL system is freely available for non-commercial use as open source project at: http://www.bioinformatics.deib.polimi.it/GMQLsystem/.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Epigenomics
  • Genome
  • Genomics
  • High-Throughput Nucleotide Sequencing*
  • Software*