Processing of big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data

Marco Masseroli; Arif Canakoglu; Pietro Pinoli; Abdulrahman Kaitoua; Andrea Gulino; Olha Horlova; Luca Nanni; Anna Bernasconi; Stefano Perna; Eirini Stamoulakatou; Stefano Ceri

doi:10.1093/bioinformatics/bty688

Processing of big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data

Bioinformatics. 2019 Mar 1;35(5):729-736. doi: 10.1093/bioinformatics/bty688.

Affiliations

¹ Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy.
² The German Research Center for Artificial Intelligence (DFKI), Berlin, Germany.

PMID: 30101316
DOI: 10.1093/bioinformatics/bty688

Abstract

Motivation: We previously proposed a paradigm shift in genomic data management, based on the Genomic Data Model (GDM) for mediating existing data formats and on the GenoMetric Query Language (GMQL) for supporting, at a high level of abstraction, data extraction and the most common data-driven computations required by tertiary data analysis of Next Generation Sequencing datasets. Here, we present a new GMQL-based system with enhanced accessibility, portability, scalability and performance.

Results: The new system has a well-designed modular architecture featuring: (i) an intermediate representation supporting many different implementations (including Spark, Flink and SciDB); (ii) a high-level technology-independent repository abstraction, supporting different repository technologies (e.g., local file system, Hadoop File System, database or others); (iii) several system interfaces, including a user-friendly Web-based interface, a Web Service interface, and a programmatic interface for Python language. Biological use case examples, using public ENCODE, Roadmap Epigenomics and TCGA datasets, demonstrate the relevance of our work.

Availability and implementation: The GMQL system is freely available for non-commercial use as open source project at: http://www.bioinformatics.deib.polimi.it/GMQLsystem/.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Epigenomics
Genome
Genomics
High-Throughput Nucleotide Sequencing*
Software*