The rich information encoded in cis-regulatory DNA sequences has not been fully exploited for gene function prediction in reverse genetics. Here we show that orthologous cis-regulatory sequences that diverged approximately 160 million years ago share little sequence similarity, yet remarkably retain semantic similarity that can be effectively captured by a deep learning model, PhytoBabel. Although trained solely on orthologous cis-regulatory sequence pairs from 15 angiosperms, PhytoBabel implicitly learned spatio-temporal gene expression patterns, conserved noncoding sequences, semantically similar fragments and phylogenetic relationships among species. Furthermore, PhytoBabel enables the discovery of evolutionarily unrelated but semantically similar cis-regulatory sequences, facilitating the identification of novel genes with functions of interest. As a proof of concept, we identified somatic embryogenesis-related morphogenic regulators in maize that exhibit semantic similarity to known Arabidopsis morphogenic regulators. By bridging the gap in the cis-regulatory sequence → semantics → gene function information chain, PhytoBabel provides a valuable tool for gene function prediction in reverse genetics.
© 2026. The Author(s), under exclusive licence to Springer Nature Limited.