Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Dec;29(12):2020-2033.
doi: 10.1101/gr.250092.119. Epub 2019 Nov 6.

Network-based Hierarchical Population Structure Analysis for Large Genomic Data Sets

Affiliations
Free PMC article

Network-based Hierarchical Population Structure Analysis for Large Genomic Data Sets

Gili Greenbaum et al. Genome Res. .
Free PMC article

Abstract

Analysis of population structure in natural populations using genetic data is a common practice in ecological and evolutionary studies. With large genomic data sets of populations now appearing more frequently across the taxonomic spectrum, it is becoming increasingly possible to reveal many hierarchical levels of structure, including fine-scale genetic clusters. To analyze these data sets, methods need to be appropriately suited to the challenges of extracting multilevel structure from whole-genome data. Here, we present a network-based approach for constructing population structure representations from genetic data. The use of community-detection algorithms from network theory generates a natural hierarchical perspective on the representation that the method produces. The method is computationally efficient, and it requires relatively few assumptions regarding the biological processes that underlie the data. We show the approach by analyzing population structure in the model plant species Arabidopsis thaliana and in human populations. These examples illustrate how network-based approaches for population structure analysis are well-suited to extracting valuable ecological and evolutionary information in the era of large genomic data sets.

Figures

Figure 1.
Figure 1.
Schematic representation of a network-based construction of a population structure tree (PST) from genomic data. (A) For each SNP, an inter-individual genetic-similarity network (adjacency matrix) is constructed using a frequency-weighted allele-sharing genetic-similarity measure (Equation 1). To produce a genome-wide genetic-similarity matrix, the mean over all loci is taken. (B) Weak edges are pruned from the matrix, by setting low matrix entries to 0 until a community structure emerges, as detected using network community-detection algorithms. Each community (numbered submatrices) is then analyzed independently in a similar manner. Notice that finer-scale clusters are characterized by darker matrices, indicating structures characterized by higher genetic similarities. (C) The analysis is summarized as a PST diagram, summarizing the hierarchical levels of population structure and their relationships.
Figure 2.
Figure 2.
Broad-scale population structure of A. thaliana. (A) The inferred population structure tree (PST). Each element in the hierarchy represents a cluster of individuals, and each cluster contains those clusters below it in the hierarchy. The root element represents the entire sample of 1214 individuals. In colored dashed lines, the main regions corresponding to sampling locations are indicated (with the labels defined post hoc). (B) Visualization of the branch corresponding to most European sampling locations. Each cluster is assigned a color such that “closer” colors represent closer clusters in the PST. On the map, each individual is placed at its sampling location and colored according to the finest-scale cluster to which it was assigned. (C) Visualization of the branch corresponding to Africa, Asia, North Sweden, and some samples in the Iberian Peninsula.
Figure 3.
Figure 3.
Fine-scale population structure of A. thaliana in the Iberian Peninsula and Morocco. On each map, a branch of the inferred PST from Figure 2 is visualized. Adjacent to each map is the PST colored in the same manner as in the map. AE show subbranches of the primarily European branch (blue branch in Fig. 2A), and F and G show subbranches of the primarily non-European branch (orange branch in Fig. 2A). (A) Branch corresponding to most of the Iberian population (Iberian Peninsula branch in Fig. 2B). (B) Branch corresponding to the western part of the Iberian Peninsula, with differentiation along a north–south gradient. (C) Branch corresponding to the northeastern part of the peninsula, with differentiation along an east–west gradient. (D) Branch corresponding to a small area in the center of the peninsula. (E) Branch corresponding to north Spain, with a differentiated subbranch along the northeast coast. (F) Branch not belonging to the primary European branch in Figure 2A, corresponding to the Iberian Peninsula and Morocco. Population structure extends on a north–south axis from Africa to Europe over the Strait of Gibraltar. (G) Branch corresponding to Morocco, showing fine-scale population structure in Morocco.
Figure 4.
Figure 4.
Population structure tree in humans. Closer colors represent closer clusters on the PST. (A) Main branches corresponding to continental groups have been labeled based on assignment of individuals in the branches to population groups (labels defined post hoc). Maya and Pima individuals are found in two subgroups, one within the Americas branch and one outside it. (B) Fine-scale structure revealed by the PST. Clusters or cluster groups in which a majority (>50%) of individuals have a particular label or labels are circled and marked with the corresponding group labels. Under each label, detailed assignments are given in the format x/y(z%): (x) number of individuals with the marked label assigned to the cluster group; (y) number of individuals in the data set with that label; (z) proportion of individuals with the marked labels among all individuals assigned to the cluster group. In the case of cluster groups with more than one hierarchical level, the detailed assignments refer to the cluster at the highest hierarchical level in the cluster group. For non-leaf cluster groups (marked with *), the proportion z is taken among all individuals in the cluster group, omitting all individuals in all descendant clusters assigned a label different from the label of the cluster group. Labels in bold indicate clusters or cluster groups (not considering omitted individuals, if relevant) that contain only individuals that have the marked labels (i.e., z = 100%).
Figure 5.
Figure 5.
Fine-scale human population structure. Shown is a visualization of the PST on a world map. Each open circle corresponds to one of the 52 groups of the HGDP data set; group coordinates from Rosenberg (2011) were used, but adjusted such that circles do not overlap (for group labels, see Supplemental Fig. S3). Individuals were positioned randomly within their corresponding circles and colored according to the finest-scale cluster to which they were assigned. To illustrate fine-scale structure at a local level, a variety of regions are shown in detail. (A) World map, with colors corresponding the coloring of the entire PST, as in Figure 4. (B) Branch corresponding to Europe. (C) Branch corresponding to the Mediterranean region. (D) Branch corresponding to northern China. (E) Branch corresponding to Japan and central and southern China. (F) Branch corresponding to central and southern China. (G) Branch corresponding to Balochi, Brahui, and Makrani groups. (H) Branch corresponding to Burusho, Kalash, Pathan, and Sindhi groups. (I) Branch corresponding to sub-Saharan Africa. In each inset, the branch has been re-colored according to an automatic coloring scheme, which assigns closer colors to clusters positioned closer in the PST, except in B and G, where each cluster was assigned a color manually, irrespective of positioning in the PST.
Figure 6.
Figure 6.
Normalized mutual information (NMI) between the PST inferred using the entire genome and PSTs inferred from subsampled fractions of the genome. The NMI values evaluate the amount of information gained on hierarchical population structure by sampling a fraction of the data set. The mean NMI across 100 random subsamples for each SNP coverage value considering the entire PST topology is shown in purple; mean NMI considering only the finest-scale leaf clusters is shown in orange. Shaded regions show standard deviations across 100 sampling replicates. (A) A. thaliana data set. Inset shows NMI for subsamples below 200,000 SNPs. (B) Human data set. NMI values saturate at values below 1 because cluster assignments often switch at fine scales (e.g., between PST leaves) when PSTs are inferred from subsampled data.

Similar articles

See all similar articles

Publication types

LinkOut - more resources

Feedback