Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Aug 25;32(8):1542-1552.
doi: 10.1101/gr.276813.122.

Haplotype and population structure inference using neural networks in whole-genome sequencing data

Affiliations

Haplotype and population structure inference using neural networks in whole-genome sequencing data

Jonas Meisner et al. Genome Res. .

Abstract

Accurate inference of population structure is important in many studies of population genetics. Here we present HaploNet, a method for performing dimensionality reduction and clustering of genetic data. The method is based on local clustering of phased haplotypes using neural networks from whole-genome sequencing or dense genotype data. By using Gaussian mixtures in a variational autoencoder framework, we are able to learn a low-dimensional latent space in which we cluster haplotypes along the genome in a highly scalable manner. We show that we can use haplotype clusters in the latent space to infer global population structure using haplotype information by exploiting the generative properties of our framework. Based on fitted neural networks and their latent haplotype clusters, we can perform principal component analysis and estimate ancestry proportions based on a maximum likelihood framework. Using sequencing data from simulations and closely related human populations, we show that our approach is better at distinguishing closely related populations than standard admixture and principal component analysis software. We further show that HaploNet is fast and highly scalable by applying it to genotype array data of the UK Biobank.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Inference of population structure in different simulation configurations. (A) Overview of simulation configuration of four splits into five populations with equal population sizes at all times. The time of population splits are designated t1, t2, t3 and t4, measured in generations. (B) Estimated ancestry proportions in one of the four simulation scenarios (Simulation 2) with t1 = 200, t2 = 120, t3 = 80, and t4 = 40 using HaploNet (top) and ADMIXTURE (bottom).
Figure 2.
Figure 2.
Estimated ancestry proportions in the full 1000 Genomes Project using HaploNet for K = 15. ADMIXTURE was not able to converge to a solution in 100 runs for this scenario.
Figure 3.
Figure 3.
Estimated ancestry proportions in the superpopulations of the 1000 Genomes Project using HaploNet (left column) and ADMIXTURE (right column) for African (AFR), American (AMR), East Asian (EAS), European (EUR), and South Asian (SAS), respectively.
Figure 4.
Figure 4.
Estimated ancestry proportions in the subset of unrelated self-identified “white British” of the UK Biobank using HaploNet for K = 3 and K = 4, respectively. Individuals are plotted by their birthplace coordinates and colored by their highest associated ancestry component.
Figure 5.
Figure 5.
The NN architecture of HaploNet split into three major substructures. Here the solid lines represent the estimation of distribution parameters, and the dashed lines represent sampling of latent variables. (A) The NN parameterizing the distribution qϕ(y|x), for sampling the haplotype cluster; (B) the network parameterizing the regularizing distribution of the sampled encoding, pθ(z|y); and (C) the network parameterizing the distribution qϕ(z|x,y), for sampling the haplotype encoding, as well as the network decoding the sampled encoding to reconstruct our input. Note that the colors of the network blocks are coherent across substructures such that the sampled y in A is used in both B and C.

Similar articles

Cited by

References

    1. The 1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature 526: 68–74. 10.1038/nature15393 - DOI - PMC - PubMed
    1. Alexander DH, Novembre J, Lange K. 2009. Fast model-based estimation of ancestry in unrelated individuals. Genome Res 19: 1655–1664. 10.1101/gr.094052.109 - DOI - PMC - PubMed
    1. Ausmees K, Nettelblad C. 2022. A deep learning framework for characterization of genotype data. G3 12: jkac020. 10.1093/g3journal/jkac020 - DOI - PMC - PubMed
    1. Baldi P. 2012. Autoencoders, unsupervised learning, and deep architectures. In Proceedings of ICML Workshop on Unsupervised and Transfer Learning. PMLR27: 37–49.
    1. Battey C, Coffing GC, Kern AD. 2021. Visualizing population structure with variational autoencoders. G3 11: jkaa036. 10.1093/g3journal/jkaa036 - DOI - PMC - PubMed

LinkOut - more resources