Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Jun 15;28(12):i84-9.
doi: 10.1093/bioinformatics/bts202.

Recognition models to predict DNA-binding specificities of homeodomain proteins

Affiliations

Recognition models to predict DNA-binding specificities of homeodomain proteins

Ryan G Christensen et al. Bioinformatics. .

Abstract

Motivation: Recognition models for protein-DNA interactions, which allow the prediction of specificity for a DNA-binding domain based only on its sequence or the alteration of specificity through rational design, have long been a goal of computational biology. There has been some progress in constructing useful models, especially for C(2)H(2) zinc finger proteins, but it remains a challenging problem with ample room for improvement. For most families of transcription factors the best available methods utilize k-nearest neighbor (KNN) algorithms to make specificity predictions based on the average of the specificities of the k most similar proteins with defined specificities. Homeodomain (HD) proteins are the second most abundant family of transcription factors, after zinc fingers, in most metazoan genomes, and as a consequence an effective recognition model for this family would facilitate predictive models of many transcriptional regulatory networks within these genomes.

Results: Using extensive experimental data, we have tested several machine learning approaches and find that both support vector machines and random forests (RFs) can produce recognition models for HD proteins that are significant improvements over KNN-based methods. Cross-validation analyses show that the resulting models are capable of predicting specificities with high accuracy. We have produced a web-based prediction tool, PreMoTF (Predicted Motifs for Transcription Factors) (http://stormo.wustl.edu/PreMoTF), for predicting position frequency matrices from protein sequence using a RF-based model.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Sequence logo of the MAFFT-generated HD multiple sequence alignment used for training the recognition models. The circles denote positions identified by our feature selection method (positions 3, 6, 19, 47, 50, 54, 55). Most HD proteins contain Asn51 (‘X’ symbol), which is a critical residue in recognition that binds Adenine with high specificity. HDs lacking Asn51, such as Lag1, tend to have very divergent recognition motifs
Fig. 2.
Fig. 2.
Average PFM for the trimmed HD multiple motif alignment
Fig. 3.
Fig. 3.
Heat map showing the protein alignment (horizontal axis) versus motif alignment (vertical axis) MIp matrix
Fig. 4.
Fig. 4.
Plot of the number of features used to train the KNN, RF and SVM models versus the 10-fold cross validation MSE values. After seven features are included the MSE stoped decreasing for KNN and SVM and did not decrease much for RF. Only the top 30 features were considered
Fig. 5.
Fig. 5.
Comparison of logos for actual and predicted motifs. The predicted motifs are from the 10-fold cross validation analysis RF model and positions 3, 6, 19, 47, 50, 54, 55. The names above each observed motif are the HD domain used for prediction and the MSE between the observed and predicted PFMs are provided above the predicted motifs

Similar articles

Cited by

References

    1. Ades S.E., Sauer R.T. Specificity of minor-groove and major-groove interactions in a homeodomain-DNA complex. Biochemistry. 1995;34:14601–14608. - PubMed
    1. Alleyne T.M., et al. Predicting the binding preference of transcription factors to individual DNA k-mers. Bioinformatics. 2009;25:1012–1018. - PMC - PubMed
    1. Bateman A., et al. Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nucleic Acids Res. 1999;27:260–262. - PMC - PubMed
    1. Benos P.V., et al. SAMIE: statistical algorithm for modeling interaction energies. Pac. Symp. Biocomput. 2001;6:115–126. - PubMed
    1. Benos P.V., et al. Is there a code for protein-DNA recognition? Probab(ilistical)ly. Bioessays. 2002a;24:466–475. - PubMed

Publication types