Evaluating a linear k-mer model for protein-DNA interactions using high-throughput SELEX data

BMC Bioinformatics. 2013;14 Suppl 10(Suppl 10):S2. doi: 10.1186/1471-2105-14-S10-S2. Epub 2013 Aug 12.

Abstract

Transcription factor (TF) binding to DNA can be modeled in a number of different ways. It is highly debated which modeling methods are the best, how the models should be built and what can they be applied to. In this study a linear k-mer model proposed for predicting TF specificity in protein binding microarrays (PBM) is applied to a high-throughput SELEX data and the question of how to choose the most informative k-mers to the binding model is studied. We implemented the standard cross-validation scheme to reduce the number of k-mers in the model and observed that the number of k-mers can often be reduced significantly without a great negative effect on prediction accuracy. We also found that the later SELEX enrichment cycles provide a much better discrimination between bound and unbound sequences as model prediction accuracies increased for all proteins together with the cycle number. We compared prediction performance of k-mer and position specific weight matrix (PWM) models derived from the same SELEX data. Consistent with previous results on PBM data, performance of the k-mer model was on average 9%-units better. For the 15 proteins in the SELEX data set with medium enrichment cycles, classification accuracies were on average 71% and 62% for k-mer and PWMs, respectively. Finally, the k-mer model trained with SELEX data was evaluated on ChIP-seq data demonstrating substantial improvements for some proteins. For protein GATA1 the model can distinquish between true ChIP-seq peaks and negative peaks. For proteins RFX3 and NFATC1 the performance of the model was no better than chance.

Publication types

  • Randomized Controlled Trial
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • DNA / genetics
  • DNA / metabolism
  • DNA-Binding Proteins / genetics
  • DNA-Binding Proteins / metabolism
  • GATA1 Transcription Factor / genetics
  • GATA1 Transcription Factor / metabolism
  • High-Throughput Nucleotide Sequencing
  • Humans
  • Linear Models*
  • NFATC Transcription Factors / genetics
  • NFATC Transcription Factors / metabolism
  • Oligonucleotide Array Sequence Analysis
  • Protein Binding / genetics
  • Protein Interaction Mapping / methods*
  • Proteins / genetics*
  • Proteins / metabolism
  • Regulatory Factor X Transcription Factors
  • Transcription Factors / genetics*
  • Transcription Factors / metabolism

Substances

  • DNA-Binding Proteins
  • GATA1 Transcription Factor
  • GATA1 protein, human
  • NFATC Transcription Factors
  • NFATC1 protein, human
  • Proteins
  • RFX3 protein, human
  • Regulatory Factor X Transcription Factors
  • Transcription Factors
  • DNA