Despite continuous updates of the human reference genome, there are still hundreds of unresolved gaps which account for about 5% of the total sequence length. Given the availability of whole genome de novo assemblies, especially those derived from long-read sequencing data, gap-closing sequences can be determined. By comparing 17 de novo long-read sequencing assemblies with the human reference genome, we identified a total of 1,125 gap-closing sequences for 132 (16.9% of 783) gaps and added up to 2.2 Mb novel sequences to the human reference genome. More than 90% of the non-redundant sequences could be verified by unmapped reads from the Simons Genome Diversity Project dataset. In addition, 15.6% of the non-reference sequences were found in at least one of four non-human primate genomes. We further demonstrated that the non-redundant sequences had high content of simple repeats and satellite sequences. Moreover, 43 (32.6%) of the 132 closed gaps were shown to be polymorphic; such sequences may play an important biological role and can be useful in the investigation of human genetic diversity.
Keywords: de novo assemblies; gap closure; genomic gaps; human genome; non-reference sequences.
Copyright © 2020 Zhao et al.