Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Dec 15;10:421.
doi: 10.1186/1471-2105-10-421.

BLAST+: Architecture and Applications

Affiliations
Free PMC article

BLAST+: Architecture and Applications

Christiam Camacho et al. BMC Bioinformatics. .
Free PMC article

Abstract

Background: Sequence similarity searching is a very important bioinformatics task. While Basic Local Alignment Search Tool (BLAST) outperforms exact methods through its use of heuristics, the speed of the current BLAST software is suboptimal for very long queries or database sequences. There are also some shortcomings in the user-interface of the current command-line applications.

Results: We describe features and improvements of rewritten BLAST software and introduce new command-line applications. Long query sequences are broken into chunks for processing, in some cases leading to dramatically shorter run times. For long database sequences, it is possible to retrieve only the relevant parts of the sequence, reducing CPU time and memory usage for searches of short queries against databases of contigs or chromosomes. The program can now retrieve masking information for database sequences from the BLAST databases. A new modular software library can now access subject sequence data from arbitrary data sources. We introduce several new features, including strategy files that allow a user to save and reuse their favorite set of options. The strategy files can be uploaded to and downloaded from the NCBI BLAST web site.

Conclusion: The new BLAST command-line applications, compared to the current BLAST tools, demonstrate substantial speed improvements for long queries as well as chromosome length database sequences. We have also improved the user interface of the command-line applications.

Figures

Figure 1
Figure 1
Schematic of a BLAST search. The first phase is "setup". The query is read, low-complexity or other filtering might be applied to the query, and a "lookup" table is built. The next phase is "scanning". Each subject sequence is scanned for words ("hits") matching those in the lookup table. These hits are further processed, extended by gap-free and gapped alignments, and scored. Significant "preliminary" matches are saved for further processing. The final phase in the BLAST algorithm, called the "trace-back", finds the locations of insertions and deletions for alignments saved in the scanning phase.
Figure 2
Figure 2
Speedup of BLASTX searches for differently sized queries with and without query splitting. Different sized pieces of [Genbank:NC_007113.2] were searched against a set of human proteins. The query length in kbases is on the x-axis, with a log scale. On the y-axis is the fractional speedup, which is defined as (Tbaseline/Tblastx) - 1. Three searches were performed with both the baseline and the blastx applications (for each data point), and the lowest time for each application was used.
Figure 3
Figure 3
L2 data cache misses for BLASTX searches with and without query splitting. Cache misses were measured by Cachegrind [24] and only misses reading from the cache are shown. On the x-axis are different query lengths in kbases. The number of L2 cache misses is shown on the y-axis. The top line is for the baseline application without query splitting, the bottom line is for the blastx application. The queries are different sized pieces of [Genbank:NC_007113.2] searched against the set of human proteins used for Figure 2.
Figure 4
Figure 4
Scatter plot of MEGABLAST search times with and without partial retrieval. 163 human ESTs from UniGene cluster 235935 were searched against all human chromosomes [22]. On the x-axis are times for the baseline application; on the y-axis are times for the new blastn application. Sequences with the best improvement are those furthest to the right, and they also matched the largest number of subject sequences. A word size of 24 was used for the runs as well as database masking with RepeatMasker. Three searches were done with both the baseline and blastn application for each data point, and the lowest time for each application was used.

Similar articles

See all similar articles

Cited by 3,456 articles

See all "Cited by" articles

References

    1. Altschul S, Gish W, Miller W, Myers E, Lipman D. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–410. - PubMed
    1. Altschul S, Madden T, Schäffer A, Zhang J, Zhang Z, Miller W, Lipman D. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–3402. doi: 10.1093/nar/25.17.3389. - DOI - PMC - PubMed
    1. NCBI C toolkit. http://www.ncbi.nlm.nih.gov/IEB/ToolBox/SDKDOCS/INDEX.HTML
    1. Zhang Z, Schäffer A, Miller W, Madden T, Lipman D, Koonin E, Altschul S. Protein sequence similarity searches using patterns as seeds. Nucleic Acids Res. 1998;26(17):3986–3990. doi: 10.1093/nar/26.17.3986. - DOI - PMC - PubMed
    1. Schäffer A, Wolf Y, Ponting C, Koonin E, Aravind L, Altschul S. IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics. 1999;15(12):1000–1011. doi: 10.1093/bioinformatics/15.12.1000. - DOI - PubMed

Publication types

LinkOut - more resources

Feedback