Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Oct 5;11(10):e0163962.
doi: 10.1371/journal.pone.0163962. eCollection 2016.

SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation

Affiliations

SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation

Wei Shen et al. PLoS One. .

Abstract

FASTA and FASTQ are basic and ubiquitous formats for storing nucleotide and protein sequences. Common manipulations of FASTA/Q file include converting, searching, filtering, deduplication, splitting, shuffling, and sampling. Existing tools only implement some of these manipulations, and not particularly efficiently, and some are only available for certain operating systems. Furthermore, the complicated installation process of required packages and running environments can render these programs less user friendly. This paper describes a cross-platform ultrafast comprehensive toolkit for FASTA/Q processing. SeqKit provides executable binary files for all major operating systems, including Windows, Linux, and Mac OSX, and can be directly used without any dependencies or pre-configurations. SeqKit demonstrates competitive performance in execution time and memory usage compared to similar tools. The efficiency and usability of SeqKit enable researchers to rapidly accomplish common FASTA/Q file manipulations. SeqKit is open source and available on Github at https://github.com/shenwei356/seqkit.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Performance comparison for FASTA/Q file parsing.
Dataset A consists of 67,748 DNA sequences with average length of 41 Kb; dataset B is the human genome with 24 chromosomes, one mitochondrial sequence and 169 scaffolds and dataset C contains 9,186,045 Illumina SE reads. All tests were repeated five times, and the average time or memory usage was computed. See supplementary data for details of test data and commands.
Fig 2
Fig 2. Performance comparison on five manipulations of FASTA file.
Dataset A consists of 67,748 DNA sequences with average length of 41 Kb and dataset B is the human genome with 24 chromosomes, one mitochondrial sequence and 169 scaffolds. All tests were repeated three times, and the average time or memory usage was computed. See supplementary data for details of test data and commands.
Fig 3
Fig 3. Performance of SeqKit on different data sizes.
The text label represents file size relative to the human genome chromosome 1 (248,956,422 bp, file size: 241.4 Mb). All tests were repeated three times, and the average time or memory usage was computed. See supplementary data for details of test data and commands.

Similar articles

Cited by

References

    1. Lipman DJ, Pearson WR. Rapid and sensitive protein similarity searches. Science. 1985;227(4693):1435–41. 10.1126/science.2983426 . - DOI - PubMed
    1. Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic acids research. 2010;38(6):1767–71. 10.1093/nar/gkp1137 - DOI - PMC - PubMed
    1. Hester J. A collection of scripts developed to interact with fasta, fastq and sam/bam files. Available from: https://github.com/jimhester/fasta_utilities.
    1. FASTX-Toolkit, FASTQ/A short-reads pre-processing tools. Available from: http://hannonlab.cshl.edu/fastx_toolkit/.
    1. Shirley MD, Ma Z, Pedersen BS, Wheelan SJ. Efficient "pythonic" access to FASTA files using pyfaidx. PeerJ Preprints. 2015;3:e1196.

Grants and funding

This work was supported by the National Natural Science Foundation of China (grant No. 81373133 to YL and No. 31570173 to FH). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.