Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Aug 17;8(8):e1000456.
doi: 10.1371/journal.pbio.1000456.

Quantitative analysis of the Drosophila segmentation regulatory network using pattern generating potentials

Affiliations

Quantitative analysis of the Drosophila segmentation regulatory network using pattern generating potentials

Majid Kazemian et al. PLoS Biol. .

Erratum in

  • PLoS Biol. 2013 Oct;11(10). doi: 10.1371/annotation/e38f4ae8-0776-42e6-b912-50800f54436e

Abstract

Cis-regulatory modules that drive precise spatial-temporal patterns of gene expression are central to the process of metazoan development. We describe a new computational strategy to annotate genomic sequences based on their "pattern generating potential" and to produce quantitative descriptions of transcriptional regulatory networks at the level of individual protein-module interactions. We use this approach to convert the qualitative understanding of interactions that regulate Drosophila segmentation into a network model in which a confidence value is associated with each transcription factor-module interaction. Sequence information from multiple Drosophila species is integrated with transcription factor binding specificities to determine conserved binding site frequencies across the genome. These binding site profiles are combined with transcription factor expression information to create a model to predict module activity patterns. This model is used to scan genomic sequences for the potential to generate all or part of the expression pattern of a nearby gene, obtained from available gene expression databases. Interactions between individual transcription factors and modules are inferred by a statistical method to quantify a factor's contribution to the module's pattern generating potential. We use these pattern generating potentials to systematically describe the location and function of known and novel cis-regulatory modules in the segmentation network, identifying many examples of modules predicted to have overlapping expression activities. Surprisingly, conserved transcription factor binding site frequencies were as effective as experimental measurements of occupancy in predicting module expression patterns or factor-module interactions. Thus, unlike previous module prediction methods, this method predicts not only the location of modules but also their spatial activity pattern and the factors that directly determine this pattern. As databases of transcription factor specificities and in vivo gene expression patterns grow, analysis of pattern generating potentials provides a general method to decode transcriptional regulatory sequences and networks.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Logistic regression model and its performance on training data.
(A) Components of the logistic regression model. For each transcription factor, its differential occupancy across the genome is described as a profile (“Factor Motif Score”) based on multi-species comparison of genomic regions using its DNA binding motif (“Factor Motif”). The contribution of the factor to the CRM's expression at that position (“Weighted Occupancy”) is described by the product of the factor's motif score in the given CRM (odd_3 in this example), its concentration (“Factor Concentration”) at a specific position along the A/P axis, and a weight assigned to the factor (“Factor Weights”). Contributions from all factors are added and transformed by a logistic function to predict the CRM's expression (“Predicted Expression,” dark blue). Factor weights are learned using a training set of CRMs with known A/P activity patterns. (B) Known (red) and predicted (dark blue) expression patterns, along the A/P axis, of 46 experimentally characterized CRMs. Heights of dark blue trace are proportional to the predicted expression level. Predictions deemed as being “good” (count = 20), “fair” (15), or “bad” (11) matches to known patterns (based on visual inspection) are indicated with green, blue, and red labels, respectively. In some cases, labels use abbreviated versions of CRM names.
Figure 2
Figure 2. The role of transcription factor Capicua in A/P patterning.
(A) The hypothetical activator TorRE and known repressor Capicua (CIC) have highly similar binding specificity (p value = 0.0012, Note 9 in Text S1), and (B) their expression profiles are perfectly complementary. (C) The regression model assigns highly significant weights to either motif and (D) the overall quality of fit is comparable between models that use one versus another. (E) Motif scores of TorRE and CIC are strongly correlated among the 46 A/P CRMs. (F) Average motif scores of TorRE (red) and CIC (blue) along the A/P axis (based on CRMs expressed at each position) are correlated, with high values at terminals. (G) Predicted regulatory network showing direct and indirect targets of CIC. Edges reflect a regulatory influence of CIC or its target TFs on any of the 35 CRMs included in the analysis, at an empirical p value threshold of 0.05. Directionality of influence is shown by arrow for activators (FKH) and flat line for repressors (CIC, HKB, KNI, TLL). Gray edges point to direct targets of CIC.
Figure 3
Figure 3. Pattern generating potential score.
(A) Schematic for CRM discovery method. A genomic region (gene transcript, plus 10 Kbp upstream and 10 Kbp downstream) is scanned with a 1 Kbp window (filled rectangles). For each window, the predicted expression profile (open blue and green rectangles) is compared to the endogenous expression profile of the gene (open red rectangle, in center) to obtain the pattern generating potential (PGP) score, which is plotted (bottom panel) as a function of the genomic coordinate of the window. (B) Design features of the PGP score that distinguish it from the correlation coefficient (CC) or the root mean square error (RMSE). For each desired feature (“Characteristic”), two scenarios of comparison between known (red) and predicted (dark blue) expression profiles (“Expression”), along with PGP, CC, and 1-RMSE values, are shown. A perfect match would correspond to a value of 1 for each score. Cases where the value of a score in the two scenarios captures the desired feature are shaded in green. (C) Computation of the PGP score. (i) The predicted expression pattern (green) is shown along with the known domain of expression (red). (ii) The average predicted expression is calculated separately for domains of expression (the “reward” term) and of non-expression (the “penalty” term) and (iii) combined into the PGP score, by subtracting the penalty term from the reward term. The penalty term is assigned thrice as much weight as the reward term. The difference of reward and penalty thus computed is scaled linearly in the final step (“y = 0.5+0.5x”), giving us the PGP score. This scaling is merely a notational convenience (making the range of PGP scores fall between −1 and 1) and irrelevant to the way PGP scores are used in our pipeline. (D) Assessment of the PGP method and previous, binding site clustering-based methods for CRM prediction in the A/P-22 set. The number of known CRMs recovered (y-axis) in the top k predicted CRMs is shown, as a function of k (x-axis). The programs are: Cluster Buster (CBust) and its multi-species version (MS_CBust, our implementation; see Note 8 in Text S1), STUBB and its multi-species version (MS_STUBB, see Note 8 in Text S1), and PGP, evaluated within a leave-one-out cross-validation setting (PGP_CV). (E) PGP score distribution for CRMs predicted in the gene sets “A/P-22” (62 CRMs), “FlyExpress” (123 CRMs), as well as a “False Positive” set. The latter consists of eight experimentally tested sequences that contain a cluster of binding sites for A/P factors but do not drive any detectable expression in the embryo (Note 10 in Text S1). Medians, quartiles, and ranges are shown.
Figure 4
Figure 4. Expression patterns of predicted CRMs compared to known gene expression patterns.
(A) Several genes in the A/P-22 set have two or more related CRMs (either predicted or known) that drive similar expression patterns. For each gene, the endogenous expression domain is shown (red), along with predicted expression profiles of CRMs (blue). Labels in bold indicate known CRMs. Predicted expression pattern is shown with color intensity proportional to expression value. (B) Gene expression pattern (red, top) is shown along with predicted expression pattern (dark blue, bottom) of 60 CRMs predicted in the FlyExpress set.
Figure 5
Figure 5. Experimental validation of predicted CRMs.
(A) Predicted expression profiles are shown for genomic segments near four genes with A/P patterning from the “FlyExpress” set (noc, SoxN, Antp, and Ubx). The predicted expression is shown as a blue curve and the binarized blastoderm expression of the endogenous gene is shown as thick red lines. Additional reporters from three genes, pdm2, emc, and apt, were not active in early embryos (unpublished data). (B) The cis-regulatory activity of each region was tested in a transgene reporter construct. Spatial activity was determined by RNA in situ using a probe to a Gal4 reporter gene. Expression of the Ubx_1 reporter begins slightly after the blastoderm stage resembling the expression of the endogenous gene.
Figure 6
Figure 6. A gene regulatory network for A/P patterning.
(A) Inference of TF–CRM interaction. For each motif, a histogram (blue) of RMSE scores (between real and predicted expression) is obtained from random permutations of the TF concentration profile, leading to a p value of the observed RMSE score (black dot on x-axis). Top right panel shows the true (red) and predicted (blue) expression profiles. Also shown is the effect of in silico “knock down” of each TF (panels on right, red border), and the corresponding RMSE score (red dot on x-axis of histograms). The expression profiles of the CAD activator and the HB and TLL repressors are shown in Figure 1A. (B) Predicted regulatory network for 10 TFs and 35 experimentally characterized CRMs. Edges reflect a regulatory influence from TF to CRM, at an empirical p value threshold of 0.05. Directionality of influence is shown by arrow for activators and flat line for repressors.
Figure 7
Figure 7. Examples of how maternal and gap patterned TFs together give rise to patterned expression.
Shown are nine sample CRMs, their expression domains (in pink) along the A/P axis (left: anterior), their regulators (as per the predicted regulatory network of Figure 6B), and their respective expression domains (in color code matching that of Figure 6B). Arrows indicate activation and barred lines indicate repressive influence. Repressor domains shown are required to be overlapping with an activator's domain of influence. Solid edges indicate that the regulatory influence is supported by previous experimental evidence in the literature, while dashed edges indicate novel interactions. Labels of TF expression domains are in black or white for better color contrast and have no semantic difference. *The edge between DSTAT and eve_stripe5 is not based on our model predictions (since DSTAT is broadly expressed and was not included in the model) but on the presence of DSTAT binding sites (motif score greater than 4 standard deviations above genomic mean) in the CRM.
Figure 8
Figure 8. Incongruous occupancy by repressors.
(A) Four examples of incongruous occupancy, from ChIP data, by repressors (GT or KR) in known CRMs. Shaded areas above horizontal axis indicate domains of expression driven by the CRM (red = real, dark blue = predicted). Shaded areas below axis are regions where a repressor (GT in blue or KR in green) is present and will thus inhibit expression if it occupies the CRM. (B) Motif presence or absence in cases of incongruous occupancy indicated by ChIP. “Multi-species Motif” shows whether the multi-species motif score is strong (>2 standard deviations above genomic mean), weak (above genomic mean), or neither. “D.mel Motif” shows, for cases where the multi-species score indicates motif absence, whether motif score in D. melanogaster is above genomic mean or not.

Comment in

Similar articles

Cited by

References

    1. Davidson E. Genomic regulatory systems: development and evolution: Academic Press 2001
    1. Schroeder M. D, Pearce M, Fak J, Fan H, Unnerstall U, et al. Transcriptional control in the segmentation gene network of Drosophila. PLoS Biol. 2004;2:E271. doi: 10.1371/journal.pbio.0020271. - DOI - PMC - PubMed
    1. Ren B, Robert F, Wyrick J. J, Aparicio O, Jennings E. G, et al. Genome-wide location and function of DNA binding proteins. Science. 2000;290:2306–2309. - PubMed
    1. Weinmann A. S, Yan P. S, Oberley M. J, Huang T. H, Farnham P. J. Isolating human transcription factor targets by coupling chromatin immunoprecipitation and CpG island microarray analysis. Genes Dev. 2002;16:235–244. - PMC - PubMed
    1. Levine M. A systems view of Drosophila segmentation. Genome Biol. 2008;9:207. - PMC - PubMed

Publication types

LinkOut - more resources