Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Apr;187(4):1219-24.
doi: 10.1534/genetics.110.126052. Epub 2011 Feb 7.

Maximally efficient modeling of DNA sequence motifs at all levels of complexity

Affiliations

Maximally efficient modeling of DNA sequence motifs at all levels of complexity

Gary D Stormo. Genetics. 2011 Apr.

Erratum in

  • Genetics. 2011 Dec;189(4):1525

Abstract

Identification of transcription factor binding sites is necessary for deciphering gene regulatory networks. Several new methods provide extensive data about the specificity of transcription factors but most methods for analyzing these data to obtain specificity models are limited in scope by, for example, assuming additive interactions or are inefficient in their exploration of more complex models. This article describes an approach--encoding of DNA sequences as the vertices of a regular simplex--that allows simultaneous direct comparison of simple and complex models, with higher-order parameters fit to the residuals of lower-order models. In addition to providing an efficient assessment of all model parameters, this approach can yield valuable insight into the mechanism of binding by highlighting features that are critical to accurate models.

PubMed Disclaimer

Figures

F<sc>igure</sc> 1.—
Figure 1.—
Each of these PWMs would assign the same score to every three-long sequence. (A) The parameters are all within the matrix, but the matrix is not unique; adding a constant to any column and subtracting that same constant from another column would give the same scores. (B) The T row is set to 0, and the external parameter +3 is added, which is the score for the sequence TTT. This matrix is unique given the constraint of 0's in the T row. (C) The preferred (lowest scoring) base in each column is set to 0, and the external parameter is −6, which is the score of that preferred sequence. This matrix is unique given that constraint and is the matrix obtained by the method of Berg and Von Hippel (1987).
F<sc>igure</sc> 2.—
Figure 2.—
The tetrahedral encoding of the bases, with the origin at 0 (central dot) and each vertex of the cube being at position 1 or −1 in each dimension. The coordinates (dashed arrows) are labeled by the degenerate nucleotide code: W = (A or T), Y = (C or T), and K = (G or T). The coordinates for each base, in WYK space, are as follows: A = (1, -1, -1); C = (-1, 1, -1); G = (-1, -1, 1); T = (1, 1, 1).
F<sc>igure</sc> 3.—
Figure 3.—
Hadamard matrices. (A) Hadamard matrix for n = 1. (B) Rule for constructing Hadamard matrices for any power of 2, given H1. (C) H4 obtained by this method. This form is “normalized” with the top row and left column as all 1's.
F<sc>igure</sc> 4.—
Figure 4.—
The Hadamard matrix H16 obtained as in Figure 3, except that the rows and columns have been rearranged to indicate the meaning of specific positions in the encoded sequences. The first column corresponds to the mean value of all sites and is deleted from the encoding to reduce the dimensionality to 15. The next three columns are the encoding of the first base of the dinucleotide, and the next three columns are for the second base of the dinucleotide, both based on the WYK encoding of Figure 2, as described in the text. The last nine columns are obtained as the outer product of the two base encodings. The order of those nine parameters is (w1w2, w1y2, w1k2, y1w2, y1y2, y1k2, k1w2, k1y2, k1k2) (see File S1 for an example). The column vector on the right shows the equivalence of each specific dinucleotide for each encoded string.
F<sc>igure</sc> 5.—
Figure 5.—
The “regression logo” (RegLogo) for the simulated data shown in File S1. The vertical axis is the energy parameter (with negative values, for the preferred bases, on top) for each mononucleotide in positions 1, 2, 3. Between them are the energy values for the adjacent dinucleotides 1, 2 and 2, 3 on the same scale. The energies for the dinucleotides are for the residual values not captured by the mononucleotide energies. So the energy for any specific three-long sequence is the sum of all the values for that sequence, including both the mononucleotide energies and the dinucleotide energies. The horizontal axis shows the variance explained by each base position and each dinucleotide. The total variance explained by a standard weight matrix is 0.86, with each dinucleotide contributing 0.07 to capture all of the variance in the data.

Similar articles

Cited by

References

    1. Benos, P. V., M. L. Bulyk and G. D. Stormo, 2002. a Additivity in protein-DNA interactions: How good an approximation is it? Nucleic Acids Res. 30 4442–4451. - PMC - PubMed
    1. Benos, P. V., A. S. Lapedes and G. D. Stormo, 2002. b Is there a code for protein-DNA recognition? Probab(ilistical)ly. BioEssays 24 466–475. - PubMed
    1. Berg, O. G., and P. H. von Hippel, 1987. Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J. Mol. Biol. 193 723–750. - PubMed
    1. Djordjevic, M., A. M. Sengupta and B. I. Shraiman, 2003. A biophysical approach to transcription factor binding site discovery. Genome Res. 13 2381–2390. - PMC - PubMed
    1. Foat, B. C., A. V. Morozov and H. J. Bussemaker, 2006. Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics 22 e141–e149. - PubMed

Publication types