A sensitive repeat identification framework based on short and long reads

Xingyu Liao; Min Li; Kang Hu; Fang-Xiang Wu; Xin Gao; Jianxin Wang

doi:10.1093/nar/gkab563

A sensitive repeat identification framework based on short and long reads

Nucleic Acids Res. 2021 Sep 27;49(17):e100. doi: 10.1093/nar/gkab563.

Authors

Xingyu Liao^{1

2}, Min Li¹, Kang Hu¹, Fang-Xiang Wu³, Xin Gao², Jianxin Wang¹

Affiliations

¹ Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, P.R. China.
² Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia.
³ Department of Mechanical Engineering and Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, SK S7N5A9, Canada.

Abstract

Numerous studies have shown that repetitive regions in genomes play indispensable roles in the evolution, inheritance and variation of living organisms. However, most existing methods cannot achieve satisfactory performance on identifying repeats in terms of both accuracy and size, since NGS reads are too short to identify long repeats whereas SMS (Single Molecule Sequencing) long reads are with high error rates. In this study, we present a novel identification framework, LongRepMarker, based on the global de novo assembly and k-mer based multiple sequence alignment for precisely marking long repeats in genomes. The major characteristics of LongRepMarker are as follows: (i) by introducing barcode linked reads and SMS long reads to assist the assembly of all short paired-end reads, it can identify the repeats to a greater extent; (ii) by finding the overlap sequences between assemblies or chomosomes, it locates the repeats faster and more accurately; (iii) by using the multi-alignment unique k-mers rather than the high frequency k-mers to identify repeats in overlap sequences, it can obtain the repeats more comprehensively and stably; (iv) by applying the parallel alignment model based on the multi-alignment unique k-mers, the efficiency of data processing can be greatly optimized and (v) by taking the corresponding identification strategies, structural variations that occur between repeats can be identified. Comprehensive experimental results show that LongRepMarker can achieve more satisfactory results than the existing de novo detection methods (https://github.com/BioinformaticsCSU/LongRepMarker).

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Animals
Base Sequence
Chromosome Mapping / methods
Computational Biology / methods*
Computer Simulation
Databases, Genetic
Genome / genetics*
High-Throughput Nucleotide Sequencing / methods*
Humans
Internet
Repetitive Sequences, Nucleic Acid / genetics*
Reproducibility of Results
Sequence Alignment / methods
Sequence Analysis, DNA / methods*