Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Feb 1;35(3):421-432.
doi: 10.1093/bioinformatics/bty648.

Scaling read aligners to hundreds of threads on general-purpose processors

Affiliations

Scaling read aligners to hundreds of threads on general-purpose processors

Ben Langmead et al. Bioinformatics. .

Abstract

Motivation: General-purpose processors can now contain many dozens of processor cores and support hundreds of simultaneous threads of execution. To make best use of these threads, genomics software must contend with new and subtle computer architecture issues. We discuss some of these and propose methods for improving thread scaling in tools that analyze each read independently, such as read aligners.

Results: We implement these methods in new versions of Bowtie, Bowtie 2 and HISAT. We greatly improve thread scaling in many scenarios, including on the recent Intel Xeon Phi architecture. We also highlight how bottlenecks are exacerbated by variable-record-length file formats like FASTQ and suggest changes that enable superior scaling.

Availability and implementation: Experiments for this study: https://github.com/BenLangmead/bowtie-scaling.

Bowtie: http://bowtie-bio.sourceforge.net.

Bowtie 2: http://bowtie-bio.sourceforge.net/bowtie2.

Hisat: http://www.ccb.jhu.edu/software/hisat.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Four threads running simultaneously in an embarrassingly parallel setting. Time progresses from top to bottom. Gray boxes show time spent waiting to enter the critical section. Black boxes show time spent in the critical section, which can be occupied by at most one thread at a time. At time t1 (dashed line), thread 1 is executing the critical section and all the other threads are running. At time t2, thread 1 is still in the critical section and threads 2 and 3 are waiting to enter. At time t3, thread 2 occupies the critical section and thread 4 is waiting
Fig. 2.
Fig. 2.
Converting a standard pair of FASTQ files (a) to blocked FASTQ files (b), where the number of bytes (B) and number of input reads per block (N) are 64 and 2, respectively. Numbers left of vertical lines indicate byte offsets for FASTQ lines, assuming newline characters (not shown) are one byte. For (b), padding spaces are represented by solid blue rectangles. The first 64 bytes of each file are colored blue and subsequent bytes are colored red. Note that the two ends differ in length; end 1 is 10 bases long and end 2 is 9 bases long. This necessitates differing amounts of padding in the two FASTQ files. But after padding, we are guaranteed that corresponding 64-byte blocks from the files contain N corresponding reads
Fig. 3.
Fig. 3.
Comparison of four lock types and multiprocessing baseline. Reads are unpaired. Results are shown for three aligners (rows) and three systems (columns). Jobs that ran for over 20 min are omitted. Squares indicate the point on each line yielding maximal total alignment throughput. These points are summarized in Table 2
Fig. 4.
Fig. 4.
Comparison of three parsing strategies and multiprocessing baseline. Reads are unpaired. Jobs that ran for over 20 min are omitted. Squares indicate the point on each line yielding maximal total alignment throughput and these points are summarized in Table 3
Fig. 5.
Fig. 5.
Unpaired-alignment comparison of B-parsing, L-parsing, L-paring with output striped across 16 files and the MP baseline. BWA-MEM is also evaluated and compared to the Bowtie 2 configurations. Jobs that ran for over 20 min are omitted. Squares indicate the run for each configuration yielding greatest overall alignment throughput, also summarized in Table 4

Similar articles

Cited by

References

    1. Aldinucci M. et al. (2017) Fastflow: high-level and efficient streaming on multi-core In: Pllana S., Xhafa F. (eds) Programming Multi-Core and Many-Core Computing Systems, Parallel and Distributed Computing. John Wiley & Sons, p. 528.
    1. Anderson T.E. (1990) The performance of spin lock alternatives for shared-money multiprocessors. IEEE Trans Parallel Distributed Systems, 1, 6–16.
    1. Auton A. et al. (2015) A global reference for human genetic variation. Nature, 526, 68–74. - PMC - PubMed
    1. Blumofe R.D. et al. (1995) Cilk: An Efficient Multithreaded Runtime System. In: PPOPP '95 Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. Santa Barbara, California, USA, Vol.30, pp. 207–216. ACM, New York, NY, USA.
    1. Bolger A.M. et al. (2014) Trimmomatic: a flexible trimmer for illumina sequence data. Bioinformatics, 170. - PMC - PubMed

Publication types