Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jul 15;35(14):i200-i207.
doi: 10.1093/bioinformatics/btz376.

TideHunter: efficient and sensitive tandem repeat detection from noisy long-reads using seed-and-chain

Affiliations

TideHunter: efficient and sensitive tandem repeat detection from noisy long-reads using seed-and-chain

Yan Gao et al. Bioinformatics. .

Abstract

Motivation: Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) sequencing technologies can produce long-reads up to tens of kilobases, but with high error rates. In order to reduce sequencing error, Rolling Circle Amplification (RCA) has been used to improve library preparation by amplifying circularized template molecules. Linear products of the RCA contain multiple tandem copies of the template molecule. By integrating additional in silico processing steps, these tandem sequences can be collapsed into a consensus sequence with a higher accuracy than the original raw reads. Existing pipelines using alignment-based methods to discover the tandem repeat patterns from the long-reads are either inefficient or lack sensitivity.

Results: We present a novel tandem repeat detection and consensus calling tool, TideHunter, to efficiently discover tandem repeat patterns and generate high-quality consensus sequences from amplified tandemly repeated long-read sequencing data. TideHunter works with noisy long-reads (PacBio and ONT) at error rates of up to 20% and does not have any limitation of the maximal repeat pattern size. We benchmarked TideHunter using simulated and real datasets with varying error rates and repeat pattern sizes. TideHunter is tens of times faster than state-of-the-art methods and has a higher sensitivity and accuracy.

Availability and implementation: TideHunter is written in C, it is open source and is available at https://github.com/yangao07/TideHunter.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Chaining of tandem repeat anchors. Three arrows represent three copies of a template sequence. Vertical line represents seed for each k-mer. The same height between seeds indicates identical k-mers. Horizontal line represents tandem repeat hit of identical k-mers. Solid and dashed lines indicate their hit distances are likely and unlikely, respectively, to be the true repeat pattern size. After the dynamic programming, the optimal chain is expected to consist of anchors that have hit distances close to the repeat pattern size
Fig. 2.
Fig. 2.
Searching for repeat unit boundary based on the global alignment. s and e are the current repeat boundaries. Two anchors A1(s1,e1) and A2(s2,e2) are selected as their starting positions are the closest to e. Two subsequences starting from s1 to s2 and from e1 to e2 are extracted to perform an end-to-end global alignment. The next repeat unit boundary e can be calculated based on the alignment result. In this example, the base G of e is matched with the G in subsequence [e1:e2], whose coordinate is then considered as the putative next boundary e
Fig. 3.
Fig. 3.
Dynamic programming matrix of sequence-to-graph alignment and three types of operations. SIMD parallelization is applicable for the match and deletion operations as they only rely on the previous rows. Insertion operation must be processed linearly as it depends on the left cell, which is in the same row

Similar articles

Cited by

References

    1. Benson G. et al. (1999) Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res., 27, 573–580. - PMC - PubMed
    1. Berlin K. et al. (2015) Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol., 33, 623.. - PubMed
    1. Calus S.T. et al. (2018) NanoAmpli-Seq: a workflow for amplicon sequencing for mixed microbial communities on the nanopore sequencing platform. GigaScience, 7, giy140. - PMC - PubMed
    1. Chin C.-S. et al. (2013) Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods, 10, 563.. - PubMed
    1. Chin C.-S. et al. (2016) Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods, 13, 1050.. - PMC - PubMed

Publication types