RecoverY: k-mer-based read classification for Y-chromosome-specific sequencing and assembly

Bioinformatics. 2018 Apr 1;34(7):1125-1131. doi: 10.1093/bioinformatics/btx771.

Abstract

Motivation: The haploid mammalian Y chromosome is usually under-represented in genome assemblies due to high repeat content and low depth due to its haploid nature. One strategy to ameliorate the low coverage of Y sequences is to experimentally enrich Y-specific material before assembly. As the enrichment process is imperfect, algorithms are needed to identify putative Y-specific reads prior to downstream assembly. A strategy that uses k-mer abundances to identify such reads was used to assemble the gorilla Y. However, the strategy required the manual setting of key parameters, a time-consuming process leading to sub-optimal assemblies.

Results: We develop a method, RecoverY, that selects Y-specific reads by automatically choosing the abundance level at which a k-mer is deemed to originate from the Y. This algorithm uses prior knowledge about the Y chromosome of a related species or known Y transcript sequences. We evaluate RecoverY on both simulated and real data, for human and gorilla, and investigate its robustness to important parameters. We show that RecoverY leads to a vastly superior assembly compared to alternate strategies of filtering the reads or contigs. Compared to the preliminary strategy used by Tomaszkiewicz et al., we achieve a 33% improvement in assembly size and a 20% improvement in the NG50, demonstrating the power of automatic parameter selection.

Availability and implementation: Our tool RecoverY is freely available at https://github.com/makovalab-psu/RecoverY.

Contact: kmakova@bx.psu.edu or pashadag@cse.psu.edu.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms
  • Animals
  • Chromosomes, Mammalian
  • Genomics / methods
  • Gorilla gorilla / genetics
  • High-Throughput Nucleotide Sequencing / methods*
  • Humans
  • Male
  • Mammals
  • Sequence Analysis, DNA / methods*
  • Software*
  • Y Chromosome*