Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 10 (1), 1025

Genome Maps Across 26 Human Populations Reveal Population-Specific Patterns of Structural Variation

Affiliations

Genome Maps Across 26 Human Populations Reveal Population-Specific Patterns of Structural Variation

Michal Levy-Sakin et al. Nat Commun.

Abstract

Large structural variants (SVs) in the human genome are difficult to detect and study by conventional sequencing technologies. With long-range genome analysis platforms, such as optical mapping, one can identify large SVs (>2 kb) across the genome in one experiment. Analyzing optical genome maps of 154 individuals from the 26 populations sequenced in the 1000 Genomes Project, we find that phylogenetic population patterns of large SVs are similar to those of single nucleotide variations in 86% of the human genome, while ~2% of the genome has high structural complexity. We are able to characterize SVs in many intractable regions of the genome, including segmental duplications and subtelomeric, pericentromeric, and acrocentric areas. In addition, we discover ~60 Mb of non-redundant genome content missing in the reference genome sequence assembly. Our results highlight the need for a comprehensive set of alternate haplotypes from different populations to represent SV patterns in the genome.

Conflict of interest statement

E.T.L., A.R.H., A.N., W.-P.W., and H.C. are employees of Bionano Genomics. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Distribution and properties of large structural variations in the human genome. a Ideogram depicting the distribution of indels and complex regions. The gray histogram above the chromosomes depicts the number of indels detected in all populations using a sliding window of 1 Mb with a 10-kb step size. Chromosome fill shows the different regions classified in the genome. Red, structurally complex regions; blue, low individual assembly coverage; black, regions with long sequence- or nick-based gaps in the reference. For display purposes, both low-coverage and gap regions were only displayed if they were longer than 500 kb. Black boxes enclose areas shown in more detail in bd. b Alignment of individual sample assemblies to the reference at chr2: 148.4–149 Mb. This region contains a low level of structural variation on the basis of both indel calls and the consensus assembly. c Alignment of individual sample assemblies to the reference at chr3: 38.4–39 Mb. This region contains a high density of indel calls but a low level of complex structural variation, and was categorized as low complexity. d Alignment of individual sample assemblies to the reference at chr9: 42–42.9 Mb. This pericentromeric region has complex structural variation as well as many indels. In bd, the red bar at top is the reference nick pattern and the bars below are consensus maps from 10 random samples. Yellow segments are highly similar to the reference, while green or red segments are shorter or longer than the corresponding reference segments, respectively, where intensity of color correlates positively with difference from the reference. Red lines represent aligned labels and black lines represent unaligned labels. Circles denote inversions while squares at the ends of contigs denote translocation break ends
Fig. 2
Fig. 2
Comparisons between the large indels identified by optical mapping (OM) in this study and the ones identified in other large-scale studies. a Number of overlapping and unique large indels identified in our study and Sudmant et al. based on the 144 samples commonly studied in the two studies (1KGP). Ins and Del correspond to insertions and deletions, respectively. Tot is the total number of indels in each category. Numbers in red, blue, and purple correspond to numbers of indels identified by OM only, by 1KGP only, and by both, respectively. Since one OMSV may overlap multiple 1KGP SVs, and vice versa, there are actually two sets of numbers in the intersection between the OM set and the 1KGP set. To keep the Venn diagram simple, we have only shown the numbers of indels in the OM set that overlap indels in the 1KGP set. This intersection also contains 406 insertions and 4473 deletions (4879 indels in total) in the 1KGP set that overlap indels in the OM set. b, c Distributions of allele frequencies of the indels uniquely identified by optical mapping (red) and commonly identified in this study and Sudmant et al. based on the 144 common samples (purple), considering only insertions (b) and only deletions (c). Since each sample has at most two SV alleles, for an SV with an allele count of x, the number of samples that support this SV is between ⌈x⌉ and x. d Number of large deletions identified by optical mapping that are also identified in any samples in four other studies (see also Supplementary Fig. 27)
Fig. 3
Fig. 3
Population structure of large indels at three different levels. a Super-population level: the average ratio of indels identified from samples in each super-population that are specific to that super-population, shared with some other super-populations but not all, or shared with all other super-populations. Random sub-sampling has been applied to balance the sizes of super-populations. The reported values are the average of 100 random sub-samples. b Population level: a phylogenetic tree constructed based on the indel occurrence matrix. c Single-sample level: the first two principal components of the indel occurrence matrix based on super-population groups. AFR Africans, AMR Americans, EAS East Asians, EUR Europeans, SAS South Asians
Fig. 4
Fig. 4
Analysis of copy number variant (CNV) in pepsinogen A cluster at chromosome 11q12.2 using multiple alignment. a Visualization of CNV haplotypes using multiple alignment of contigs allows easy counting of the repeating units. Copy numbers (CNs) are shown on the left. Alignment of all individuals across the 26 populations is provided in Supplementary Fig. 13. b A boxplot for CN in different ethnic groups. East Asians (EAS) have significantly more copies than other populations (p < 0.05, Tukey test), while Europeans (EUR) have significantly fewer copies than other populations (p < 0.05, Tukey test) except Americans (AMR)
Fig. 5
Fig. 5
Characterization and extension of chromosome 21p11.2 by multiple alignment. a Multiple alignment of selected contigs at 21p11.2. Red and blue arrows below indicate the subregions with patterns absent and present in chr21 of hg38, respectively. b The genome structures of hg38 (left) and the proposed new structure based on multiple alignment (right). The genome structure is represented as a flow chart of signal patterns where red and blue nodes represent patterns absent and present in chr21 of hg38, respectively
Fig. 6
Fig. 6
Chromosome- and population-dependent distribution of paralogy block 3 in subtelomeric regions. Previous subtelomere repeat elements paralogy block references, shown as colored rectangles above each chromosome arm. Yellow rows depict consensus contigs beneath blue bars representing hg38 references. Dark green dashes indicate Nt.BspQI nick sites matching the reference while lighter green dashes represent unmatched nick sites. Teal arrows measure regions not in either reference. Additional paralogy blocks are also shown above these extended regions with dashed boxes indicating regions matching specific blocks. Paralogy block 3, shown by the dashed box in purple, is found on only one haplotype of 6p, on all haplotypes of 15q, and on a less common haplotype of 7p
Fig. 7
Fig. 7
Acrocentric chromosome patterns in non-aligned maps. Non-aligned maps were grouped into unique patterns. These groups were analyzed for patterns localizing maps to near telomeric regions. a An acrocentric map (blue) supported by molecules (yellow) with Nt.BspQI labels (green dots) aligned to an in silico labeled map (dark blue vertical lines). Molecules comprising the maps were localized to acrocentric regions by a CRISPR-Cas9 labeling method described previously. The green T indicates telomere labeling. The 27-kb homolog, shown by the blue teeth underneath the chromosome 4p map, was identified by in silico nicking of the chromosome 4p sequence indicated by red bars throughout the figure. b Seven unique acrocentric maps (yellow bars with green BspQI labels) were identified and aligned to a summarized version (blue bar). The red-, green-, and blue-colored bars indicate the elements defining an acrocentric map. c A general model for an expected acrocentric map. From the telomeric end (green T), the 27-kb homolog must be present (red bar), followed by tandem repeat region 1 (TR1), a unique labeling pattern (green bar, Pattern 1), a second unique labeling pattern (blue bar, Pattern 2), and a final tandem repeat region (TR2). Patterns 1 and 2 exhibit little variation, while unit counts of the tandem repeat regions varied considerably

Similar articles

See all similar articles

Cited by 10 PubMed Central articles

See all "Cited by" articles

References

    1. Gudbjartsson DF, et al. Large-scale whole-genome sequencing of the Icelandic population. Nat. Genet. 2015;47:435–444. doi: 10.1038/ng.3247. - DOI - PubMed
    1. Telenti A, et al. Deep sequencing of 10,000 human genomes. Proc. Natl Acad. Sci. USA. 2016;113:11901–11906. doi: 10.1073/pnas.1613365113. - DOI - PMC - PubMed
    1. 1000 Genomes Project Consortium. et al. A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. - DOI - PMC - PubMed
    1. Lek M, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536:285–291. doi: 10.1038/nature19057. - DOI - PMC - PubMed
    1. The Genome Aggregation Database (gnomAD). http://gnomad.broadinstitute.org/.

Publication types

Feedback