Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Nov 22:10:1091.
doi: 10.3389/fgene.2019.01091. eCollection 2019.

Phenotype Prediction and Genome-Wide Association Study Using Deep Convolutional Neural Network of Soybean

Affiliations

Phenotype Prediction and Genome-Wide Association Study Using Deep Convolutional Neural Network of Soybean

Yang Liu et al. Front Genet. .

Abstract

Genomic selection uses single-nucleotide polymorphisms (SNPs) to predict quantitative phenotypes for enhancing traits in breeding populations and has been widely used to increase breeding efficiency for plants and animals. Existing statistical methods rely on a prior distribution assumption of imputed genotype effects, which may not fit experimental datasets. Emerging deep learning technology could serve as a powerful machine learning tool to predict quantitative phenotypes without imputation and also to discover potential associated genotype markers efficiently. We propose a deep-learning framework using convolutional neural networks (CNNs) to predict the quantitative traits from SNPs and also to investigate genotype contributions to the trait using saliency maps. The missing values of SNPs are treated as a new genotype for the input of the deep learning model. We tested our framework on both simulation data and experimental datasets of soybean. The results show that the deep learning model can bypass the imputation of missing values and achieve more accurate results for predicting quantitative phenotypes than currently available other well-known statistical methods. It can also effectively and efficiently identify significant markers of SNPs and SNP combinations associated in genome-wide association study.

Keywords: deep learning; genome-wide association study; genomic selection; genotype contribution; soybean.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Dual-stream CNN model structure. Genotypes are one-hot coded and passed to the input processing block, which contains two streams of CNNs. The first stacked-CNN stream contains two feed-forward CNN layers with kernel sizes 4 and 20. The second single-CNN stream contains one CNN layer with kernel size 4, followed by an add-up layer to aggregate outputs from the two streams. The feature processing block contains another single convolution layer with kernel size 4. Processed features are then passed to the output processing block, which contains a flatten layer and a fully connected dense layer. CNN, convolutional neural network.
Figure 2
Figure 2
Training loss different deep learning models. The x-axis is number of epochs; the y-axis is the training the loss of mean absolute error (MAE) of validation dataset. The singleCNN (purple), dualCNN (blue), and Dense (green) network are conserved, and DeepGS is overfitting after 20 epochs, and our dualCNN has the lowest training loss. CNN, convolutional neural network.
Figure 3
Figure 3
Average Pearson correlation coefficient of five traits using different sizes of training dataset. The x-axis is number of folds of training data; the y-axis is the average Pearson correlation coefficient from cross-validation.
Figure 4
Figure 4
Comparison of genotype contribution using saliency map and GWAS Wald test of simulation (A) and experimental soybean dataset with five traits (BF). The x-axis is the index of SNPs in the genotype matrix; the y-axis is the saliency and Wald test results. Top ranked SNPs were plotted in red. GWAS, genome-wide association study; SNP, single-nucleotide polymorphism.

Similar articles

Cited by

References

    1. Akond A. M., Ragin B., Bazzelle R., Kantartzi S. K., Meksem K., Kassem M. A. (2012). Quantitative trait loci associated with moisture, protein, and oil content in soybean [Glycine max (L.) Merr.]. J. Agric. Sci. 4 (11), 16. 10.5539/jas.v4n11p16 - DOI
    1. Alipanahi B., Delong A., Weirauch M. T., Frey B. J. (2015). Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nat. Biotechnol. 33 (8), 831. 10.1038/nbt.3300 - DOI - PubMed
    1. Angermueller C., Pärnamaa T., Parts L., Stegle O. (2016). Deep learning for computational biology. Mol. Systems Biol. 1 2 (7), 878. 10.15252/msb.20156651 - DOI - PMC - PubMed
    1. Ashburner M., Ball C. A., Blake J. A., Botstein D., Butler H., Cherry J. M., et al. (2000). Gene Ontology: tool for the unification of biology. Nat. Genet. 25 (1), 25. 10.1038/75556 - DOI - PMC - PubMed
    1. Bateman A., Coin L., Durbin R., Finn R. D., Hollich V., Griffiths‐Jones S., et al. (2004). The Pfam protein families database. Nucleic Acids Res. 32 (suppl_1), D138–D141. 10.1093/nar/gkh121 - DOI - PMC - PubMed

LinkOut - more resources