With the advent of high throughput sequencing and high resolution transcriptomic technologies, there exists today an unprecedented opportunity to understand gene regulation at a quantitative level. State of the art models of the relationship between regulatory sequence and gene expression have shown great promise, but also suffer from some major shortcomings. In this paper, we identify and address methodological challenges pertaining to quantitative modeling of gene expression from sequence, and test our models on the anterior-posterior patterning system in the Drosophila embryo. We first develop a framework to process cellular resolution three-dimensional gene expression data from the Drosophila embryo and create data sets on which quantitative models can be trained. Next we propose a new score, called 'weighted pattern generating potential' (w-PGP), to evaluate model predictions, and show its advantages over the two most common scoring schemes in use today. The model building exercise uses w-PGP as the evaluation score and adopts a systematic strategy to increase a model's complexity while guarding against over-fitting. Our model identifies three transcription factors--ZELDA, SLOPPY-PAIRED, and NUBBIN--that have not been previously incorporated in quantitative models of this system, as having significant regulatory influence. Finally, we show how fitting quantitative models on data sets comprising a handful of enhancers, as reported in earlier work, may lead to unreliable models.
Keywords: Cellular resolution data; Drosophila A/P patterning system; Enhancer; Quantitative model; Transcription factor; Transcriptional regulation.
Copyright © 2013. Published by Elsevier Inc.