Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Nov;21(8):2689-2705.
doi: 10.1111/1755-0998.13386. Epub 2021 May 3.

Automatic inference of demographic parameters using generative adversarial networks

Affiliations

Automatic inference of demographic parameters using generative adversarial networks

Zhanpeng Wang et al. Mol Ecol Resour. 2021 Nov.

Abstract

Population genetics relies heavily on simulated data for validation, inference and intuition. In particular, since the evolutionary 'ground truth' for real data is always limited, simulated data are crucial for training supervised machine learning methods. Simulation software can accurately model evolutionary processes but requires many hand-selected input parameters. As a result, simulated data often fail to mirror the properties of real genetic data, which limits the scope of methods that rely on it. Here, we develop a novel approach to estimating parameters in population genetic models that automatically adapts to data from any population. Our method, pg-gan, is based on a generative adversarial network that gradually learns to generate realistic synthetic data. We demonstrate that our method is able to recover input parameters in a simulated isolation-with-migration model. We then apply our method to human data from the 1000 Genomes Project and show that we can accurately recapitulate the features of real data.

Keywords: demographic inference; evolutionary modelling; generative adversarial network; simulated data.

PubMed Disclaimer

Figures

FIGURE 1
FIGURE 1
pg‐gan algorithm overview. The inputs to our method are an evolutionary model and a set of real data (orange). The parameters of the generator and discriminator (green) are updated in a unified training framework using simulated annealing (generator) and backpropagation (discriminator). The generated data and real data are analysed one genotype matrix at a time, where n is the number of haplotypes and S is the number of SNPs retained in each region. Inter‐SNP distances are also fed in as a second channel, which provides the discriminator with information about SNP density
FIGURE 2
FIGURE 2
Multi‐population CNN discriminator architecture. Each example region is of shape (n,S,2), where n is the number of haplotypes (usually with n/2 from population 1 and n/2 from population 2). The convolutional filters for population 1 and 2 are shared (i.e. not separate weights) so that haplotype commonalities can be more easily identified. The final output of the discriminator is the probability the region is real (which can be subtracted from 1 to find the probability the region is simulated). This CNN can be reduced for one population or extended for three populations
FIGURE 3
FIGURE 3
Set of models. (a) A six‐parameter, two‐population isolation‐with‐migration model, which we use in the simulation study. The migration event is a single pulse at time Tsplit/2, and can be in either direction. The final parameter (not shown in this diagram) is the recombination rate. (b) A five‐parameter, single‐population exponential growth model, which we use to infer histories for YRI, CEU and CHB separately. (c) A seven‐parameter, two‐population model, which we fit separately for YRI/CHB and YRI/CEU. The migration can be in either direction. (d) A seven‐parameter, two population model which we fit to CEU/CHB. Migration occurs at T2/2 and can be in either direction
FIGURE 4
FIGURE 4
IM model parameter inference on simulated training data. In this scenario, we jointly infer the six parameters of the IM model from Figure 3a. The top plot shows both loss functions over the course of GAN training, and the second plot shows classification accuracy for both simulated and training data. The remaining plots show the model parameters as they are refined throughout GAN training. The inferred values are taken at the final iteration
FIGURE 5
FIGURE 5
IM model statistics on simulated training data. Summary statistics for data simulated under our inferred parameters (‘simulated data’), compared with data simulated under the true parameters (‘training data’). Subfigures on the left correspond to statistics from the first population, and those on the right correspond to the second population. In the bottom panel, we show Fst between the two populations
FIGURE 6
FIGURE 6
IM model SFS as inferred by fastsimcoal. Here, we compare the true SFS (‘training data’) with the SFS computed from data simulated under the parameters learned by fastsimcoal (‘simulated data’)
FIGURE 7
FIGURE 7
Single‐population model. Summary statistic comparisons between 1000 Genomes Project data and data simulated under our pg‐gan inferred parameters for a variety of scenarios. Top left: YRI vs. data simulated under the one‐parameter constant population size model. Simulated accuracy: 0.52, overall accuracy: 0.63. Top right: YRI vs. data simulated under the five‐parameter exponential growth model. Simulated accuracy: 0.72, overall accuracy: 0.58. Bottom left: CHB vs. data simulated under the one‐parameter constant population size model. Simulated accuracy: 0.68, overall accuracy: 0.66. Bottom right: CHB vs. data simulated under the five‐parameter exponential growth model. Simulated accuracy: 0.54, overall accuracy: 0.49
FIGURE 8
FIGURE 8
GAN confusion for 1‐ and 2‐population models. (a) Comparison of one‐ and five‐parameter models. We use a constant population size for the first group of bars, then move to the five‐parameter exponential growth model (Figure 3b). We sample recombination rates from HapMap in both scenarios, instead of fixing the recombination rate. (b) Classification accuracy results on the population split models for YRI/CEU, YRI/CHB and CEU/CHB. The Out‐of‐Africa models and parameter inference for YRI/CEU and YRI/CHB generally seem to do well, but the CEU/CHB split model and/or parameter inference does not result in simulated data that matches real data
FIGURE 9
FIGURE 9
YRI/CEU: two‐population model. Summary statistic comparison real 1000 Genomes data and data simulated under the inferred parameters from Table 4 (first row). Left: statistics computed on YRI samples only. Right: statistics computed on CEU samples only. Sites with count zero are segregating in only one population. Fst between the two populations is shown in the bottom panel. Simulated accuracy: 0.68, overall accuracy: 0.54
FIGURE 10
FIGURE 10
YRI/CEU: two‐population model (fastsimcoal). Summary statistic comparison between YRI/CEU and data simulated under the OOA2 model parameters inferred by fastsimcoal. Here, we include all the statistics (unlike Figure 6) since we are providing fastsimcoal with a recombination rate distribution. Left: statistics computed on YRI samples only. Right: statistics computed on CEU samples only. Sites with count zero are segregating in only one population. Fst between the two populations is shown in the bottom panel

Similar articles

Cited by

References

    1. 1000 Genomes Project Consortium . (2015). A global reference for human genetic variation. Nature, 526(7571), 68–74. - PMC - PubMed
    1. Abadi, M. , Agarwal, A. , Barham, P. , Brevdo, E. , Chen, Z. , Citro, C. , Corrado, G. S. , Davis, A. , Dean, J. , Devin, M. , Ghemawat, S. , Goodfellow, I. , Harp, A. , Irving, G. , Isard, M. , & Jia, Y. , … Zheng, X. (2015). TensorFlow: Large‐scale machine learning on heterogeneous systems. Available from: https://www.tensorflow.org/Softwaretensorflow.org
    1. Adrion, J. R. , Cole, C. B. , Dukler, N. , Galloway, J. G. , Gladstein, A. L. , Gower, G. , Kyriazis, C. C. , Ragsdale, A. P. , Tsambos, G. , Baumdicker, F. , Carlson, J. , Cartwright, R. A. , Durvasula, A. , Gronau, I. , Kim, B. Y. , McKenzie, P. , Messer, P. W. , Noskova, E. , Ortega‐Del Vecchyo, D. , … Kern, A. D. (2020). A community‐maintained standard library of population genetic models. eLife, 9, 10.7554/eLife.54967 - DOI - PMC - PubMed
    1. Adrion, J. R. , Galloway, J. G. , & Kern, A. D. (2020). Predicting the landscape of recombination using deep learning. Molecular Biology and Evolution, 37(6), 1790–1808. - PMC - PubMed
    1. Battey, C. J. , Coffing, G. C. , & Kern, A. D. (2021). Visualizing population structure with variational autoencoders. G3 Genes, Genomes, Genetics, 11(1), 1–11. 10.1093/g3journal/jkaa036 - DOI - PMC - PubMed

LinkOut - more resources