Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 20 (Suppl 9), 337

ImaGene: A Convolutional Neural Network to Quantify Natural Selection From Genomic Data

Affiliations

ImaGene: A Convolutional Neural Network to Quantify Natural Selection From Genomic Data

Luis Torada et al. BMC Bioinformatics.

Abstract

Background: The genetic bases of many complex phenotypes are still largely unknown, mostly due to the polygenic nature of the traits and the small effect of each associated mutation. An alternative approach to classic association studies to determining such genetic bases is an evolutionary framework. As sites targeted by natural selection are likely to harbor important functionalities for the carrier, the identification of selection signatures in the genome has the potential to unveil the genetic mechanisms underpinning human phenotypes. Popular methods of detecting such signals rely on compressing genomic information into summary statistics, resulting in the loss of information. Furthermore, few methods are able to quantify the strength of selection. Here we explored the use of deep learning in evolutionary biology and implemented a program, called ImaGene, to apply convolutional neural networks on population genomic data for the detection and quantification of natural selection.

Results: ImaGene enables genomic information from multiple individuals to be represented as abstract images. Each image is created by stacking aligned genomic data and encoding distinct alleles into separate colors. To detect and quantify signatures of positive selection, ImaGene implements a convolutional neural network which is trained using simulations. We show how the method implemented in ImaGene can be affected by data manipulation and learning strategies. In particular, we show how sorting images by row and column leads to accurate predictions. We also demonstrate how the misspecification of the correct demographic model for producing training data can influence the quantification of positive selection. We finally illustrate an approach to estimate the selection coefficient, a continuous variable, using multiclass classification techniques.

Conclusions: While the use of deep learning in evolutionary genomics is in its infancy, here we demonstrated its potential to detect informative patterns from large-scale genomic data. We implemented methods to process genomic data for deep learning in a user-friendly program called ImaGene. The joint inference of the evolutionary history of mutations and their functional impact will facilitate mapping studies and provide novel insights into the molecular mechanisms associated with human phenotypes.

Keywords: Convolutional neural networks; Natural selection; Population genetics; Supervised machine learning.

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Image representations of human population genomic data for EDAR gene. In panels a and b, each row represents a population from the 1000 Genomes Project data set, sorted from the top to the bottom by increasing geographical distance from central Africa. Each pixel encodes for the frequency of four nucleotides (panel a) or the derived allele (panel b) for each polymorphism. Panels c-e refer to the Han Chinese population only, and each row represents a sampled haplotype. Pixel encodes for the frequency of four nucleotides (c), the derived allele (d) or the minor allele calculated across all populations (e)
Fig. 2
Fig. 2
Image representations with different sorting conditions. The same image of genomic data is presented before (a) and after its rows (b), columns (c), or both (d) have been sorted by frequency of occurrence
Fig. 3
Fig. 3
Accuracy of detecting positive selection using images with different sorting conditions. For each tested strength of positive selection (S={200,300,400}) we report the confusion matrices for predicting whether a genomic region is under neutrality (N) or selection (S) when images have been sorted with different conditions
Fig. 4
Fig. 4
Accuracy of quantifying positive selection under different training models. We report the confusion matrices for predicting whether a genomic region is under neutrality (S=0), weak-to-moderate selection (S=200), or strong selection (S=400) when the network has been trained under the correct demographic model (3-epoch, on the left) or the incorrect one (1-epoch, on the right)
Fig. 5
Fig. 5
Accuracy of quantifying positive selection under different representation of the distribution of true labels. Confusion matrices for estimating selection coefficients into 11 intervals from 0 to 400. Classification was performed assuming a different representation of true labels, either as a categorical distribution, a Guassian distribution, or a perturbed categorical distribution
Fig. 6
Fig. 6
Sampled posterior distributions of selection coefficients. Histograms of 100,000 random samples from the posterior distributions of one case of weak-to-moderate selection (S=120, on the left) and one case of strong selection (S=320, on the right). Point estimates and credible intervals are reported

Similar articles

See all similar articles

References

    1. Levy SE, Myers RM. Advancements in next-generation sequencing. Annu Rev Genomics Hum Genet. 2016;17:95–115. doi: 10.1146/annurev-genom-083115-022413. - DOI - PubMed
    1. Liu S, Lorenzen ED, Fumagalli M, Li B, Harris K, Xiong Z, Zhou L, Korneliussen TS, Somel M, Babbitt C, et al. Population genomics reveal recent speciation and rapid evolutionary adaptation in polar bears. Cell. 2014;157(4):785–94. doi: 10.1016/j.cell.2014.03.054. - DOI - PMC - PubMed
    1. Ilardo M, Nielsen R. Human adaptation to extreme environmental conditions. Curr Opin Genet Dev. 2018;53:77–82. doi: 10.1016/j.gde.2018.07.003. - DOI - PubMed
    1. Vasseur E, Quintana-Murci L. The impact of natural selection on health and disease: uses of the population genetics approach in humans. Evol Appl. 2013;6(4):596–607. doi: 10.1111/eva.12045. - DOI - PMC - PubMed
    1. Karlsson EK, Kwiatkowski DP, Sabeti PC. Natural selection and infectious disease in human populations. Nat Rev Genet. 2014;15(6):379. doi: 10.1038/nrg3734. - DOI - PMC - PubMed

LinkOut - more resources

Feedback