CoLoRMap: Correcting Long Reads by Mapping short reads

Bioinformatics. 2016 Sep 1;32(17):i545-i551. doi: 10.1093/bioinformatics/btw463.

Abstract

Motivation: Second generation sequencing technologies paved the way to an exceptional increase in the number of sequenced genomes, both prokaryotic and eukaryotic. However, short reads are difficult to assemble and often lead to highly fragmented assemblies. The recent developments in long reads sequencing methods offer a promising way to address this issue. However, so far long reads are characterized by a high error rate, and assembling from long reads require a high depth of coverage. This motivates the development of hybrid approaches that leverage the high quality of short reads to correct errors in long reads.

Results: We introduce CoLoRMap, a hybrid method for correcting noisy long reads, such as the ones produced by PacBio sequencing technology, using high-quality Illumina paired-end reads mapped onto the long reads. Our algorithm is based on two novel ideas: using a classical shortest path algorithm to find a sequence of overlapping short reads that minimizes the edit score to a long read and extending corrected regions by local assembly of unmapped mates of mapped short reads. Our results on bacterial, fungal and insect data sets show that CoLoRMap compares well with existing hybrid correction methods.

Availability and implementation: The source code of CoLoRMap is freely available for non-commercial use at https://github.com/sfu-compbio/colormap

Contact: ehaghshe@sfu.ca or cedric.chauve@sfu.ca

Supplementary information: Supplementary data are available at Bioinformatics online.

MeSH terms

  • Algorithms*
  • Computational Biology
  • Genome
  • Programming Languages
  • Sequence Alignment
  • Sequence Analysis, DNA*
  • Software