iRSpot-ADPM: Identify recombination spots by incorporating the associated dinucleotide product model into Chou's pseudo components

J Theor Biol. 2018 Mar 14:441:1-8. doi: 10.1016/j.jtbi.2017.12.025. Epub 2018 Jan 2.

Abstract

Gene recombination is a key process to produce hereditary differences. Recombination spot identification plays an important role in revealing genome evolution and promoting DNA function study. However, traditional experiments are not good at identifying recombination spot with huge amounts of DNA sequences springed up by sequencing. At present, some machine learning methods have been proposed to speed up this identification process. However, the correlations between nucleotides pairs at different positions along DNA sequence is often ignored, which reflects the important sequence order information. For this purpose, this study proposes a novel feature extraction method, called iRSpot-ADPM, based on DNA property in a given DNA sequence. 85 features are selected from the original feature set according to the weights calculated by support vector machine. Five-fold cross validation tests on two widely used benchmark datasets indicate that the proposed method outperforms its existing counterparts on the individual specificity(Spec), Matthews correlation coefficient(MCC) value and overall accuracy(OA). The experimental results show that the proposed method is effective for accurate recombination spot identification. Moreover, it is anticipated that the proposed method could be extended to other biology sequence and be helpful in future research. The datasets and Matlab source codes can be download from the URL: http://stxy.neuq.edu.cn/info/1095/1157.htm.

Keywords: Associated information; Property matrix; Recombination spots; Support vector machines.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Computational Biology / methods*
  • DNA / genetics
  • Internet
  • Models, Genetic*
  • Nucleotides / genetics*
  • Recombination, Genetic*
  • Reproducibility of Results
  • Sequence Analysis, DNA / methods
  • Software
  • Support Vector Machine

Substances

  • Nucleotides
  • DNA