Using neural networks to mine text and predict metabolic traits for thousands of microbes

PLoS Comput Biol. 2021 Mar 2;17(3):e1008757. doi: 10.1371/journal.pcbi.1008757. eCollection 2021 Mar.


Microbes can metabolize more chemical compounds than any other group of organisms. As a result, their metabolism is of interest to investigators across biology. Despite the interest, information on metabolism of specific microbes is hard to access. Information is buried in text of books and journals, and investigators have no easy way to extract it out. Here we investigate if neural networks can extract out this information and predict metabolic traits. For proof of concept, we predicted two traits: whether microbes carry one type of metabolism (fermentation) or produce one metabolite (acetate). We collected written descriptions of 7,021 species of bacteria and archaea from Bergey's Manual. We read the descriptions and manually identified (labeled) which species were fermentative or produced acetate. We then trained neural networks to predict these labels. In total, we identified 2,364 species as fermentative, and 1,009 species as also producing acetate. Neural networks could predict which species were fermentative with 97.3% accuracy. Accuracy was even higher (98.6%) when predicting species also producing acetate. Phylogenetic trees of species and their traits confirmed that predictions were accurate. Our approach with neural networks can extract information efficiently and accurately. It paves the way for putting more metabolic traits into databases, providing easy access of information to investigators.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Acetates / metabolism
  • Archaea* / classification
  • Archaea* / metabolism
  • Bacteria* / classification
  • Bacteria* / metabolism
  • Computational Biology
  • Data Mining / methods*
  • Databases, Factual
  • Fermentation / physiology
  • Neural Networks, Computer*
  • Phylogeny


  • Acetates

Grants and funding

This work was supported by Hatch Project Accession 1019985 (TJH) and 1024983 (TJH) from the United States Department of Agriculture National Institute of Food and Agriculture. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.