CoLoRMap: Correcting Long Reads by Mapping short reads

Ehsan Haghshenas; Faraz Hach; S Cenk Sahinalp; Cedric Chauve

doi:10.1093/bioinformatics/btw463

CoLoRMap: Correcting Long Reads by Mapping short reads

Bioinformatics. 2016 Sep 1;32(17):i545-i551. doi: 10.1093/bioinformatics/btw463.

Authors

Ehsan Haghshenas¹, Faraz Hach², S Cenk Sahinalp³, Cedric Chauve⁴

Affiliations

¹ School of Computing Sciences MADD-Gen Graduate Program, Simon Fraser University, Burnaby, BC V5A 1S6, Canada.
² School of Computing Sciences Vancouver Prostate Centre, Vancouver, BC V6H 3Z6, Canada.
³ School of Computing Sciences Vancouver Prostate Centre, Vancouver, BC V6H 3Z6, Canada, School of Informatics and Computing, Indiana University, Bloomington, IN 47405, USA.
⁴ Department of Mathematics, Simon Fraser University, Burnaby, BC V5A 1S6, Canada.

PMID: 27587673
DOI: 10.1093/bioinformatics/btw463

Abstract

Motivation: Second generation sequencing technologies paved the way to an exceptional increase in the number of sequenced genomes, both prokaryotic and eukaryotic. However, short reads are difficult to assemble and often lead to highly fragmented assemblies. The recent developments in long reads sequencing methods offer a promising way to address this issue. However, so far long reads are characterized by a high error rate, and assembling from long reads require a high depth of coverage. This motivates the development of hybrid approaches that leverage the high quality of short reads to correct errors in long reads.

Results: We introduce CoLoRMap, a hybrid method for correcting noisy long reads, such as the ones produced by PacBio sequencing technology, using high-quality Illumina paired-end reads mapped onto the long reads. Our algorithm is based on two novel ideas: using a classical shortest path algorithm to find a sequence of overlapping short reads that minimizes the edit score to a long read and extending corrected regions by local assembly of unmapped mates of mapped short reads. Our results on bacterial, fungal and insect data sets show that CoLoRMap compares well with existing hybrid correction methods.

Availability and implementation: The source code of CoLoRMap is freely available for non-commercial use at https://github.com/sfu-compbio/colormap

Contact: ehaghshe@sfu.ca or cedric.chauve@sfu.ca

Supplementary information: Supplementary data are available at Bioinformatics online.

MeSH terms

Algorithms*
Computational Biology
Genome
Programming Languages
Sequence Alignment
Sequence Analysis, DNA*
Software