Fully-sensitive seed finding in sequence graphs using a hybrid index

Bioinformatics. 2019 Jul 15;35(14):i81-i89. doi: 10.1093/bioinformatics/btz341.

Abstract

Motivation: Sequence graphs are versatile data structures that are, for instance, able to represent the genetic variation found in a population and to facilitate genome assembly. Read mapping to sequence graphs constitutes an important step for many applications and is usually done by first finding exact seed matches, which are then extended by alignment. Existing methods for finding seed hits prune the graph in complex regions, leading to a loss of information especially in highly polymorphic regions of the genome. While such complex graph structures can indeed lead to a combinatorial explosion of possible alleles, the query set of reads from a diploid individual realizes only two alleles per locus-a property that is not exploited by extant methods.

Results: We present the Pan-genome Seed Index (PSI), a fully-sensitive hybrid method for seed finding, which takes full advantage of this property by combining an index over selected paths in the graph with an index over the query reads. This enables PSI to find all seeds while eliminating the need to prune the graph. We demonstrate its performance with different parameter settings on both simulated data and on a whole human genome graph constructed from variants in the 1000 Genome Project dataset. On this graph, PSI outperforms GCSA2 in terms of index size, query time and sensitivity.

Availability and implementation: The C++ implementation is publicly available at: https://github.com/cartoonist/psi.

MeSH terms

  • Algorithms*
  • Alleles
  • Diploidy
  • Genome, Human*
  • Humans
  • Sequence Analysis, DNA
  • Software*