Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jun 3;2:e421.
doi: 10.7717/peerj.421. eCollection 2014.

BALSA: Integrated Secondary Analysis for Whole-Genome and Whole-Exome Sequencing, Accelerated by GPU

Affiliations
Free PMC article

BALSA: Integrated Secondary Analysis for Whole-Genome and Whole-Exome Sequencing, Accelerated by GPU

Ruibang Luo et al. PeerJ. .
Free PMC article

Abstract

This paper reports an integrated solution, called BALSA, for the secondary analysis of next generation sequencing data; it exploits the computational power of GPU and an intricate memory management to give a fast and accurate analysis. From raw reads to variants (including SNPs and Indels), BALSA, using just a single computing node with a commodity GPU board, takes 5.5 h to process 50-fold whole genome sequencing (∼750 million 100 bp paired-end reads), or just 25 min for 210-fold whole exome sequencing. BALSA's speed is rooted at its parallel algorithms to effectively exploit a GPU to speed up processes like alignment, realignment and statistical testing. BALSA incorporates a 16-genotype model to support the calling of SNPs and Indels and achieves competitive variant calling accuracy and sensitivity when compared to the ensemble of six popular variant callers. BALSA also supports efficient identification of somatic SNVs and CNVs; experiments showed that BALSA recovers all the previously validated somatic SNVs and CNVs, and it is more sensitive for somatic Indel detection. BALSA outputs variants in VCF format. A pileup-like SNAPSHOT format, while maintaining the same fidelity as BAM in variant calling, enables efficient storage and indexing, and facilitates the App development of downstream analyses. BALSA is available at: http://sourceforge.net/p/balsa.

Keywords: GPU; Genomics; HPC; NGS; Secondary analysis; Variant calling; Whole-exome sequencing; Whole-genome seqeuncing.

Figures

Figure 1
Figure 1. BALSA, based on SOAP3-dp, performs the whole secondary analysis (raw reads to variants) in memory with most of the modules accelerated with GPU.
Figure 2
Figure 2. A flowchart of the pipeline of BALSA.
BQSR denotes “base quality score recalibration”.
Figure 3
Figure 3. Time consumption comparison between pipelines analyzing YH 50-fold 100 bp paired-end WGS data.
Figure 4
Figure 4. Correlation plot between the RandomForest Probability generated by BALSA and the VQSLOD value generated by GATK’s VQSR on YH 50-fold 100 bp paired-end WGS data.
Figure 5
Figure 5. Venn graphs illustrating the overlaps between (1) BALSA, (2) the Ensemble call set, and (3) the known variants on both SNP and Indel.
AAF denotes “alternative allele frequency”, i.e., percentage of reads supporting the alternative allele among all simulated reads covering a variant. DP represents the number reads simulated covering a variant. Qual means the variant score assigned by BALSA.
Figure 6
Figure 6. Venn graphs illustrating the overlaps between (1) BALSA, (2) ISAAC, and (3) the known variants on both SNP and Indel.

Similar articles

See all similar articles

Cited by 4 articles

References

    1. Challis D, Yu J, Evani US, Jackson AR, Paithankar S, Coarfa C, Milosavljevic A, Gibbs RA, Yu F. An integrative variant analysis suite for whole exome next-generation sequencing data. BMC Bioinformatics. 2012;13:8. doi: 10.1186/1471-2105-13-8. - DOI - PMC - PubMed
    1. Cibulskis K, Lawrence MS, Carter SL, Sivachenko A, Jaffe D, Sougnez C, Gabriel S, Meyerson M, Lander ES, Getz G. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nature Biotechnology. 2013;31:213–219. doi: 10.1038/nbt.2514. - DOI - PMC - PubMed
    1. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics. 2011;43:491–498. doi: 10.1038/ng.806. - DOI - PMC - PubMed
    1. Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. 20121207.3907
    1. Gupta M, Siegel J. 2013. GPU accelerated signal processing in the ion proton whole genome sequencer. Available at http://on-demand.gputechconf.com/gtc/2013/presentations/S3229-Signal-Processing-Whole-Genome -quencer.pdf (accessed 31 March 2014)

Grant support

This work was funded by Hong Kong GRF (General Research Fund) HKU-713512E and ITF (Innovation and Technology Fund) GHP/011/12. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

LinkOut - more resources

Feedback