A tool for feature extraction from biological sequences

Brief Bioinform. 2022 May 13;23(3):bbac108. doi: 10.1093/bib/bbac108.

Abstract

With the advances in sequencing technologies, a huge amount of biological data is extracted nowadays. Analyzing this amount of data is beyond the ability of human beings, creating a splendid opportunity for machine learning methods to grow. The methods, however, are practical only when the sequences are converted into feature vectors. Many tools target this task including iLearnPlus, a Python-based tool which supports a rich set of features. In this paper, we propose a holistic tool that extracts features from biological sequences (i.e. DNA, RNA and Protein). These features are the inputs to machine learning models that predict properties, structures or functions of the input sequences. Our tool not only supports all features in iLearnPlus but also 30 additional features which exist in the literature. Moreover, our tool is based on R language which makes an alternative for bioinformaticians to transform sequences into feature vectors. We have compared the conversion time of our tool with that of iLearnPlus: we transform the sequences much faster. We convert small nucleotides by a median of 2.8X faster, while we outperform iLearnPlus by a median of 6.3X for large sequences. Finally, in amino acids, our tool achieves a median speedup of 23.9X.

Keywords: R package; bioinformatics; biological sequences; feature extraction; machine learning; sequence-based feature.

MeSH terms

  • DNA / genetics
  • Humans
  • Machine Learning*
  • Proteins* / chemistry
  • RNA / genetics
  • Sequence Analysis / methods

Substances

  • Proteins
  • RNA
  • DNA