Phased Genome Assemblies

Methods Mol Biol. 2023:2590:273-286. doi: 10.1007/978-1-0716-2819-5_16.

Abstract

The ultimate goal of de novo assembly of reads sequenced from a diploid individual is the separate reconstruction of the sequences corresponding to the two copies of each chromosome. Unfortunately, the allele linkage information needed to perform phased genome assemblies has been difficult to generate. Hence, most current genome assemblies are a haploid mixture of the two underlying chromosome copies present in the sequenced individual. Sequencing technologies providing long (20 kb) and accurate reads are the basis to generate phased genome assemblies. This chapter provides a brief overview of the main milestones in traditional genome assembly, focusing on the bioinformatic techniques developed to generate haplotype information from different specialized protocols. Using these techniques as a knowledge background, the chapter reviews the current algorithms to generate phased assemblies from long reads with low error rates. Current techniques perform haplotype-aware error correction steps to increase the quality of the raw reads. In addition, variations on the traditional overlap-layout-consensus (OLC) graph have been developed in an effort to eliminate edges between reads sequenced from different chromosome copies. This allows for large presence-absence variants between the chromosome copies to be taken into account. The development of these algorithms, along with the improved sequencing technologies has been crucial to finish chromosome-level assemblies of complex genomes.

Keywords: Algorithms; Genomics; Phased assembly; Ploidy; Sequencing.

MeSH terms

  • Algorithms*
  • Alleles
  • Computational Biology*
  • Haplotypes
  • High-Throughput Nucleotide Sequencing / methods
  • Sequence Analysis, DNA / methods