Federated sharing and processing of genomic datasets for tertiary data analysis

Brief Bioinform. 2021 May 20;22(3):bbaa091. doi: 10.1093/bib/bbaa091.

Abstract

Motivation: With the spreading of biological and clinical uses of next-generation sequencing (NGS) data, many laboratories and health organizations are facing the need of sharing NGS data resources and easily accessing and processing comprehensively shared genomic data; in most cases, primary and secondary data management of NGS data is done at sequencing stations, and sharing applies to processed data. Based on the previous single-instance GMQL system architecture, here we review the model, language and architectural extensions that make the GMQL centralized system innovatively open to federated computing.

Results: A well-designed extension of a centralized system architecture to support federated data sharing and query processing. Data is federated thanks to simple data sharing instructions. Queries are assigned to execution nodes; they are translated into an intermediate representation, whose computation drives data and processing distributions. The approach allows writing federated applications according to classical styles: centralized, distributed or externalized.

Availability: The federated genomic data management system is freely available for non-commercial use as an open source project at http://www.bioinformatics.deib.polimi.it/FederatedGMQLsystem/.

Contact: {arif.canakoglu, pietro.pinoli}@polimi.it.

Publication types

  • Research Support, Non-U.S. Gov't
  • Review

MeSH terms

  • Datasets as Topic*
  • Genomics*
  • High-Throughput Nucleotide Sequencing / methods
  • Humans
  • Information Dissemination*
  • Programming Languages