PGD: a machine learning-based photosynthetic-related gene detection approach

BMC Bioinformatics. 2022 May 17;23(1):183. doi: 10.1186/s12859-022-04722-x.


Background: The primary determinant of crop yield is photosynthetic capacity, which is under the control of photosynthesis-related genes. Therefore, the mining of genes involved in photosynthesis is important for the study of photosynthesis. MapMan Mercator 4 is a powerful annotation tool for assigning genes into proper functional categories; however, in maize, the functions of approximately 22.15% (9520) of genes remain unclear and are labeled "not assigned", which may include photosynthesis-related genes that have not yet been identified. The fast-increasing usage of the machine learning approach in solving biological problems provides us with a new chance to identify novel photosynthetic genes from functional "not assigned" genes in maize.

Results: In this study, we proved the ensemble learning model using a voting eliminates the preferences of single machine learning models. Based on this evaluation, we implemented an ensemble based ML(Machine Learning) methods using a majority voting scheme and observed that including RNA-seq data from multiple photosynthetic mutants rather than only a single mutant could increase prediction accuracy. And we call this approach "A Machine Learning-based Photosynthetic-related Gene Detection approach (PGD)". Finally, we predicted 716 photosynthesis-related genes from the "not assigned" category of maize MapMan annotation. The protein localization prediction (TargetP) and expression trends of these genes from maize leaf sections indicated that the prediction was reliable and robust. And we put this approach online base on google colab.

Conclusions: This study reveals a new approach for mining novel genes related to a specific functional category and provides candidate genes for researchers to experimentally define their biological functions.

Keywords: Ensemble learning; Functional category; Machine learning; Photosynthesis; RNA-Seq.

MeSH terms

  • Female
  • Humans
  • Machine Learning
  • Photosynthesis / genetics
  • Plant Leaves / metabolism
  • Pregnancy
  • Preimplantation Diagnosis*
  • Zea mays / genetics