Improved predictions of transcription factor binding sites using physicochemical features of DNA

Nucleic Acids Res. 2012 Dec;40(22):e175. doi: 10.1093/nar/gks771. Epub 2012 Aug 25.

Abstract

Typical approaches for predicting transcription factor binding sites (TFBSs) involve use of a position-specific weight matrix (PWM) to statistically characterize the sequences of the known sites. Recently, an alternative physicochemical approach, called SiteSleuth, was proposed. In this approach, a linear support vector machine (SVM) classifier is trained to distinguish TFBSs from background sequences based on local chemical and structural features of DNA. SiteSleuth appears to generally perform better than PWM-based methods. Here, we improve the SiteSleuth approach by considering both new physicochemical features and algorithmic modifications. New features are derived from Gibbs energies of amino acid-DNA interactions and hydroxyl radical cleavage profiles of DNA. Algorithmic modifications consist of inclusion of a feature selection step, use of a nonlinear kernel in the SVM classifier, and use of a consensus-based post-processing step for predictions. We also considered SVM classification based on letter features alone to distinguish performance gains from use of SVM-based models versus use of physicochemical features. The accuracy of each of the variant methods considered was assessed by cross validation using data available in the RegulonDB database for 54 Escherichia coli TFs, as well as by experimental validation using published ChIP-chip data available for Fis and Lrp.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, U.S. Gov't, Non-P.H.S.
  • Validation Study

MeSH terms

  • Algorithms
  • Artificial Intelligence
  • Binding Sites
  • Chromatin Immunoprecipitation
  • DNA / chemistry*
  • DNA / metabolism
  • Escherichia coli / genetics
  • Escherichia coli Proteins / metabolism
  • Factor For Inversion Stimulation Protein / metabolism
  • Leucine-Responsive Regulatory Protein / metabolism
  • Nucleotide Motifs
  • Support Vector Machine*
  • Transcription Factors / metabolism*

Substances

  • Escherichia coli Proteins
  • Factor For Inversion Stimulation Protein
  • Fis protein, E coli
  • Lrp protein, E coli
  • Transcription Factors
  • Leucine-Responsive Regulatory Protein
  • DNA