SeqWare Query Engine: storing and searching sequence data in the cloud
- PMID: 21210981
- PMCID: PMC3040528
- DOI: 10.1186/1471-2105-11-S12-S2
SeqWare Query Engine: storing and searching sequence data in the cloud
Abstract
Background: Since the introduction of next-generation DNA sequencers the rapid increase in sequencer throughput, and associated drop in costs, has resulted in more than a dozen human genomes being resequenced over the last few years. These efforts are merely a prelude for a future in which genome resequencing will be commonplace for both biomedical research and clinical applications. The dramatic increase in sequencer output strains all facets of computational infrastructure, especially databases and query interfaces. The advent of cloud computing, and a variety of powerful tools designed to process petascale datasets, provide a compelling solution to these ever increasing demands.
Results: In this work, we present the SeqWare Query Engine which has been created using modern cloud computing technologies and designed to support databasing information from thousands of genomes. Our backend implementation was built using the highly scalable, NoSQL HBase database from the Hadoop project. We also created a web-based frontend that provides both a programmatic and interactive query interface and integrates with widely used genome browsers and tools. Using the query engine, users can load and query variants (SNVs, indels, translocations, etc) with a rich level of annotations including coverage and functional consequences. As a proof of concept we loaded several whole genome datasets including the U87MG cell line. We also used a glioblastoma multiforme tumor/normal pair to both profile performance and provide an example of using the Hadoop MapReduce framework within the query engine. This software is open source and freely available from the SeqWare project (http://seqware.sourceforge.net).
Conclusions: The SeqWare Query Engine provided an easy way to make the U87MG genome accessible to programmers and non-programmers alike. This enabled a faster and more open exploration of results, quicker tuning of parameters for heuristic variant calling filters, and a common data interface to simplify development of analytical tools. The range of data types supported, the ease of querying and integrating with existing tools, and the robust scalability of the underlying cloud-based technologies make SeqWare Query Engine a nature fit for storing and searching ever-growing genome sequence datasets.
Figures
Similar articles
-
CloudDOE: a user-friendly tool for deploying Hadoop clouds and analyzing high-throughput sequencing data with MapReduce.PLoS One. 2014 Jun 4;9(6):e98146. doi: 10.1371/journal.pone.0098146. eCollection 2014. PLoS One. 2014. PMID: 24897343 Free PMC article.
-
BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data.BMC Bioinformatics. 2018 Jun 26;19(1):240. doi: 10.1186/s12859-018-2241-z. BMC Bioinformatics. 2018. PMID: 29940842 Free PMC article.
-
SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision.Bioinformatics. 2014 Sep 15;30(18):2652-3. doi: 10.1093/bioinformatics/btu343. Epub 2014 May 19. Bioinformatics. 2014. PMID: 24845651
-
Practical guide for managing large-scale human genome data in research.J Hum Genet. 2021 Jan;66(1):39-52. doi: 10.1038/s10038-020-00862-1. Epub 2020 Oct 23. J Hum Genet. 2021. PMID: 33097812 Free PMC article. Review.
-
Parallel computing for genome sequence processing.Brief Bioinform. 2021 Sep 2;22(5):bbab070. doi: 10.1093/bib/bbab070. Brief Bioinform. 2021. PMID: 33822883 Review.
Cited by
-
Cloud Computing Enabled Big Multi-Omics Data Analytics.Bioinform Biol Insights. 2021 Jul 28;15:11779322211035921. doi: 10.1177/11779322211035921. eCollection 2021. Bioinform Biol Insights. 2021. PMID: 34376975 Free PMC article. Review.
-
Serine hydroxymethyltransferase 2 expression promotes tumorigenesis in rhabdomyosarcoma with 12q13-q14 amplification.J Clin Invest. 2021 Aug 2;131(15):e138022. doi: 10.1172/JCI138022. J Clin Invest. 2021. PMID: 34166228 Free PMC article.
-
Isabl Platform, a digital biobank for processing multimodal patient data.BMC Bioinformatics. 2020 Nov 30;21(1):549. doi: 10.1186/s12859-020-03879-7. BMC Bioinformatics. 2020. PMID: 33256603 Free PMC article.
-
JBrowse Connect: A server API to connect JBrowse instances and users.PLoS Comput Biol. 2020 Aug 18;16(8):e1007261. doi: 10.1371/journal.pcbi.1007261. eCollection 2020 Aug. PLoS Comput Biol. 2020. PMID: 32810130 Free PMC article.
-
A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce.Genes (Basel). 2020 Feb 5;11(2):166. doi: 10.3390/genes11020166. Genes (Basel). 2020. PMID: 32033366 Free PMC article.
References
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials
