Alignment-free clustering of large data sets of unannotated protein conserved regions using minhashing

Armen Abnousi; Shira L Broschat; Ananth Kalyanaraman

doi:10.1186/s12859-018-2080-y

Alignment-free clustering of large data sets of unannotated protein conserved regions using minhashing

BMC Bioinformatics. 2018 Mar 5;19(1):83. doi: 10.1186/s12859-018-2080-y.

Authors

Armen Abnousi¹, Shira L Broschat^{2

3

4}, Ananth Kalyanaraman^{2

3}

Affiliations

¹ School of EECS, Washington State University, 355 NE Spokane St, Pullman, 99164, USA. aabnousi@eecs.wsu.edu.
² School of EECS, Washington State University, 355 NE Spokane St, Pullman, 99164, USA.
³ Paul G. Allen School for Global Animal Health, Washington State University, Pullman, 99164, USA.
⁴ Department of Veterinary Microbiology and Pathology, Washington State University, Pullman, 99164, USA.

Abstract

Background: Clustering of protein sequences is of key importance in predicting the structure and function of newly sequenced proteins and is also of use for their annotation. With the advent of multiple high-throughput sequencing technologies, new protein sequences are becoming available at an extraordinary rate. The rapid growth rate has impeded deployment of existing protein clustering/annotation tools which depend largely on pairwise sequence alignment.

Results: In this paper, we propose an alignment-free clustering approach, coreClust, for annotating protein sequences using detected conserved regions. The proposed algorithm uses Min-Wise Independent Hashing for identifying similar conserved regions. Min-Wise Independent Hashing works by generating a (w,c)-sketch for each document and comparing these sketches. Our algorithm fits well within the MapReduce framework, permitting scalability. We show that coreClust generates results comparable to existing known methods. In particular, we show that the clusters generated by our algorithm capture the subfamilies of the Pfam domain families for which the sequences in a cluster have a similar domain architecture. We show that for a data set of 90,000 sequences (about 250,000 domain regions), the clusters generated by our algorithm give a 75% average weighted F1 score, our accuracy metric, when compared to the clusters generated by a semi-exhaustive pairwise alignment algorithm.

Conclusions: The new clustering algorithm can be used to generate meaningful clusters of conserved regions. It is a scalable method that when paired with our prior work, NADDA for detecting conserved regions, provides a complete end-to-end pipeline for annotating protein sequences.

Keywords: Clustering; Protein conserved region; Protein domain families.

Publication types

Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms*
Amino Acid Sequence
Cluster Analysis
Databases, Protein*
Molecular Sequence Annotation*
Phylogeny
Protein Domains
Rickettsia / classification
Sequence Alignment / methods*

Grants and funding

1262664/National Science Foundation/International