Machine learning-based approach KEVOLVE efficiently identifies SARS-CoV-2 variant-specific genomic signatures

PLoS One. 2024 Jan 19;19(1):e0296627. doi: 10.1371/journal.pone.0296627. eCollection 2024.

Abstract

Machine learning was shown to be effective at identifying distinctive genomic signatures among viral sequences. These signatures are defined as pervasive motifs in the viral genome that allow discrimination between species or variants. In the context of SARS-CoV-2, the identification of these signatures can assist in taxonomic and phylogenetic studies, improve in the recognition and definition of emerging variants, and aid in the characterization of functional properties of polymorphic gene products. In this paper, we assess KEVOLVE, an approach based on a genetic algorithm with a machine-learning kernel, to identify multiple genomic signatures based on minimal sets of k-mers. In a comparative study, in which we analyzed large SARS-CoV-2 genome dataset, KEVOLVE was more effective at identifying variant-discriminative signatures than several gold-standard statistical tools. Subsequently, these signatures were characterized using a new extension of KEVOLVE (KANALYZER) to highlight variations of the discriminative signatures among different classes of variants, their genomic location, and the mutations involved. The majority of identified signatures were associated with known mutations among the different variants, in terms of functional and pathological impact based on available literature. Here we showed that KEVOLVE is a robust machine learning approach to identify discriminative signatures among SARS-CoV-2 variants, which are frequently also biologically relevant, while bypassing multiple sequence alignments. The source code of the method and additional resources are available at: https://github.com/bioinfoUQAM/KEVOLVE.

MeSH terms

  • COVID-19* / diagnosis
  • COVID-19* / genetics
  • Genomics
  • Humans
  • Machine Learning
  • Phylogeny
  • SARS-CoV-2* / genetics

Supplementary concepts

  • SARS-CoV-2 variants

Grants and funding

This work is funded by the Canadian Institute of Health Research. Abdoulaye Banire Diallo is supported by grants from the Natural Sciences and Engineering Research Council of Canada (NSERC). Hugo Soudeyns and Isabelle Boucoiran are supported by an infrastructure grant from Réseau SIDA et MI of Fonds de la recherche du Québec-santé (FRQS). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.