Cluster Analysis of Coronavirus Sequences using Computational Sequence Descriptors: With Applications to SARS, MERS and SARS-CoV-2 (CoVID-19)

Marjan Vračko; Subhash C Basak; Tathagata Dey; Ashesh Nandy

doi:10.2174/1573409917666210202092646

Cluster Analysis of Coronavirus Sequences using Computational Sequence Descriptors: With Applications to SARS, MERS and SARS-CoV-2 (CoVID-19)

Curr Comput Aided Drug Des. 2021;17(7):936-945. doi: 10.2174/1573409917666210202092646.

Authors

Marjan Vračko¹, Subhash C Basak², Tathagata Dey³, Ashesh Nandy³

Affiliations

¹ Theoretical Department. National Institute of Chemistry, Hajdrihova 19, 1000 Ljubljana, Slovenia.
² Department of Chemistry and Biochemistry, University of Minnesota, Duluth, USA.
³ Centre for Interdisciplinary Research and Education, Kolkata, India.

PMID: 33530913
DOI: 10.2174/1573409917666210202092646

Abstract

Introduction: Coronaviruses comprise a group of enveloped, positive-sense single-stranded RNA viruses that infect humans as well as a wide range of animals. The study was performed on a set of 573 sequences belonging to SARS, MERS and SARS-CoV-2 (CoVID-19) viruses. The sequences were represented with alignment-free sequence descriptors and analyzed with different chemometric methods: Euclidean/Mahalanobis distances, principal component analysis and self-organizing maps (Kohonen networks). We report the cluster structures of the data. The sequences are well-clustered regarding the type of virus; however, some of them show the tendency to belong to more than one virus type.

Background: This is a study of 573 genome sequences belonging to SARS, MERS and SARS-- CoV-2 (CoVID-19) coronaviruses.

Objectives: The aim was to compare the virus sequences, which originate from different places around the world.

Methods: The study used alignment free sequence descriptors for the representation of sequences and chemometric methods for analyzing clusters.

Results: Majority of genome sequences are clustered with respect to the virus type, but some of them are outliers.

Conclusion: We indicate 71 sequences, which tend to belong to more than one cluster.

Keywords: Euclidean distance; MERS; Mahalanobis distance; SARS; SARS-CoV-2 (CoVID-19); alignment-free sequenc descriptors.; clustering; mathematical representation of sequences; principal component analysis.

MeSH terms

Animals
COVID-19*
Cluster Analysis
Humans
SARS-CoV-2*

Grants and funding

P1-0017/Slovenian Research Agency (ARRS)