Closing Human Reference Genome Gaps: Identifying and Characterizing Gap-Closing Sequences

G3 (Bethesda). 2020 Aug 5;10(8):2801-2809. doi: 10.1534/g3.120.401280.


Despite continuous updates of the human reference genome, there are still hundreds of unresolved gaps which account for about 5% of the total sequence length. Given the availability of whole genome de novo assemblies, especially those derived from long-read sequencing data, gap-closing sequences can be determined. By comparing 17 de novo long-read sequencing assemblies with the human reference genome, we identified a total of 1,125 gap-closing sequences for 132 (16.9% of 783) gaps and added up to 2.2 Mb novel sequences to the human reference genome. More than 90% of the non-redundant sequences could be verified by unmapped reads from the Simons Genome Diversity Project dataset. In addition, 15.6% of the non-reference sequences were found in at least one of four non-human primate genomes. We further demonstrated that the non-redundant sequences had high content of simple repeats and satellite sequences. Moreover, 43 (32.6%) of the 132 closed gaps were shown to be polymorphic; such sequences may play an important biological role and can be useful in the investigation of human genetic diversity.

Keywords: de novo assemblies; gap closure; genomic gaps; human genome; non-reference sequences.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Genome, Human*
  • High-Throughput Nucleotide Sequencing*
  • Humans
  • Sequence Analysis, DNA