Deep learning-based semantic matching of cis-regulatory DNA sequences facilitates the prediction of gene function

Nat Plants. 2026 Mar;12(3):542-555. doi: 10.1038/s41477-026-02231-w. Epub 2026 Feb 18.

Abstract

The rich information encoded in cis-regulatory DNA sequences has not been fully exploited for gene function prediction in reverse genetics. Here we show that orthologous cis-regulatory sequences that diverged approximately 160 million years ago share little sequence similarity, yet remarkably retain semantic similarity that can be effectively captured by a deep learning model, PhytoBabel. Although trained solely on orthologous cis-regulatory sequence pairs from 15 angiosperms, PhytoBabel implicitly learned spatio-temporal gene expression patterns, conserved noncoding sequences, semantically similar fragments and phylogenetic relationships among species. Furthermore, PhytoBabel enables the discovery of evolutionarily unrelated but semantically similar cis-regulatory sequences, facilitating the identification of novel genes with functions of interest. As a proof of concept, we identified somatic embryogenesis-related morphogenic regulators in maize that exhibit semantic similarity to known Arabidopsis morphogenic regulators. By bridging the gap in the cis-regulatory sequence → semantics → gene function information chain, PhytoBabel provides a valuable tool for gene function prediction in reverse genetics.

MeSH terms

  • Arabidopsis / genetics
  • Deep Learning*
  • Gene Expression Regulation, Plant
  • Magnoliopsida* / genetics
  • Regulatory Sequences, Nucleic Acid* / genetics
  • Semantics
  • Zea mays / genetics