HapCompass: a fast cycle basis algorithm for accurate haplotype assembly of sequence data

J Comput Biol. 2012 Jun;19(6):577-90. doi: 10.1089/cmb.2012.0084.


Genome assembly methods produce haplotype phase ambiguous assemblies due to limitations in current sequencing technologies. Determining the haplotype phase of an individual is computationally challenging and experimentally expensive. However, haplotype phase information is crucial in many bioinformatics workflows such as genetic association studies and genomic imputation. Current computational methods of determining haplotype phase from sequence data--known as haplotype assembly--have difficulties producing accurate results for large (1000 genomes-type) data or operate on restricted optimizations that are unrealistic considering modern high-throughput sequencing technologies. We present a novel algorithm, HapCompass, for haplotype assembly of densely sequenced human genome data. The HapCompass algorithm operates on a graph where single nucleotide polymorphisms (SNPs) are nodes and edges are defined by sequence reads and viewed as supporting evidence of co-occurring SNP alleles in a haplotype. In our graph model, haplotype phasings correspond to spanning trees. We define the minimum weighted edge removal optimization on this graph and develop an algorithm based on cycle basis local optimizations for resolving conflicting evidence. We then estimate the amount of sequencing required to produce a complete haplotype assembly of a chromosome. Using these estimates together with metrics borrowed from genome assembly and haplotype phasing, we compare the accuracy of HapCompass, the Genome Analysis ToolKit, and HapCut for 1000 Genomes Project and simulated data. We show that HapCompass performs significantly better for a variety of data and metrics. HapCompass is freely available for download (www.brown.edu/Research/Istrail_Lab/).

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms*
  • Alleles
  • Chromosome Mapping / methods*
  • Computational Biology / methods*
  • Genome, Human*
  • Haplotypes
  • High-Throughput Nucleotide Sequencing
  • Humans
  • Polymorphism, Single Nucleotide
  • Sequence Analysis, DNA / methods*