3gClust: Human Protein Cluster Analysis

IEEE/ACM Trans Comput Biol Bioinform. 2019 Nov-Dec;16(6):1773-1784. doi: 10.1109/TCBB.2018.2840996. Epub 2018 May 30.

Abstract

We present a human protein cluster analysis by combining: 1) n-gram based amino acid frequency features, 2) optimal feature selection, 3) hierarchical clustering, and 4) advanced partitioning techniques. Our method qualitatively and quantitatively groups proteins with increasing sequence similarity into similar clusters by calculating the frequency model of amino acids using n-grams. We experiment with n = 1, i.e., unigrams, n = 2, i.e., bigrams, and finally n = 3, i.e., trigrams for optimal selection of features to design the 3gClust algorithm. The benchmarking results on 20,105 manually curated human proteins show that 3gClust ensures better cluster compactness in the case of proteins with similar functional groups, biological processes, structural alignment, and shared domains (e.g., aquaporins, keratins). Quantitative analysis of non singleton clusters shows significant improvement in their compactness in comparison to other state-of-the art methodologies. 3gClust is available at https://sites.google.com/site/bioinfoju/projects/3gclust for academic use along with supplementary materials, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2840996, and datasets.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Cell Membrane / metabolism
  • Cluster Analysis*
  • Computational Biology / methods*
  • Computer Simulation
  • Databases, Protein
  • Humans
  • Machine Learning
  • Phylogeny
  • Protein Conformation
  • Proteins / chemistry*
  • Sequence Alignment
  • Transferases / chemistry*

Substances

  • Proteins
  • Transferases