Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Feb 15;27(4):509-15.
doi: 10.1093/bioinformatics/btq701. Epub 2010 Dec 24.

A Computationally Efficient Modular Optimal Discovery Procedure

Affiliations
Free PMC article

A Computationally Efficient Modular Optimal Discovery Procedure

Sangsoon Woo et al. Bioinformatics. .
Free PMC article

Abstract

Motivation: It is well known that patterns of differential gene expression across biological conditions are often shared by many genes, particularly those within functional groups. Taking advantage of these patterns can lead to increased statistical power and biological clarity when testing for differential expression in a microarray experiment. The optimal discovery procedure (ODP), which maximizes the expected number of true positives for each fixed number of expected false positives, is a framework aimed at this goal. Storey et al. introduced an estimator of the ODP for identifying differentially expressed genes. However, their ODP estimator grows quadratically in computational time with respect to the number of genes. Reducing this computational burden is a key step in making the ODP practical for usage in a variety of high-throughput problems.

Results: Here, we propose a new estimate of the ODP called the modular ODP (mODP). The existing 'full ODP' requires that the likelihood function for each gene be evaluated according to the parameter estimates for all genes. The mODP assigns genes to modules according to a Kullback-Leibler distance, and then evaluates the statistic only at the module-averaged parameter estimates. We show that the mODP is relatively insensitive to the choice of the number of modules, but dramatically reduces the computational complexity from quadratic to linear in the number of genes. We compare the full ODP algorithm and mODP on simulated data and gene expression data from a recent study of Morrocan Amazighs. The mODP and full ODP algorithm perform very similarly across a range of comparisons.

Availability: The mODP methodology has been implemented into EDGE, a comprehensive gene expression analysis software package in R, available at http://genomine.org/edge/.

Figures

Fig. 1.
Fig. 1.
A heatmap of simulated gene expression data for a study comparing three groups. The genes inside the black box show three common gene expression patterns; the first pattern is downregulated for groups 1 and 2 and upregulated for group 3. The second pattern is downregulated for groups 1 and 3 and upregulated for group 2. The third pattern is downregulated for group 2 and upregulated for groups 1 and 3. The number of genes sharing each pattern is different, and only three of the six possible differential expression patterns are present. The ODP is designed to utilize these expression patterns to improve inference of differential expression.
Fig. 2.
Fig. 2.
A demonstration of the difference between the ODP approach and LR statistic. Suppose that hypothesis tests H0 : μ=0 versus H1 : μ≠0 are performed on μ1, μ2,…, μm based on respective datasets x1, x2,…, xm. Shown are the likelihood functions for test 5, L(μ|x5) in red, and test 13, L(μ|x13) in blue. Their maximum likelihood estimates are such that formula image, implying that they would produce equal LR statistics. The ODP utilizes information from all of the maximum likelihood estimates formula image, shown at the top of the plot. These tend to be more similar to formula image than formula image, lending greater evidence against the null hypothesis for test 13. The ODP quantifies this evidence by calculating the likelihood functions over all maximum likelihood estimates, shown as red dots for test 5 and in blue dots for test 13. It can be seen that formula image, implying that the ODP statistic for test 13 would be larger than that for test 5. This makes sense in that there are many more positive formula image than negative, so we should attribute stronger evidence against the null hypothesis to those tests with positive estimates. In more complex situations such as those encountered in gene expression studies, this aggregation of information becomes even more useful.
Fig. 3.
Fig. 3.
A plot of relative CPU time for increasing numbers of genes for the mODP and the full ODP estimators under one of the simulation scenarios. The full ODP grows approximately quadratically in the number of genes while the mODP grows nearly linearly.
Fig. 4.
Fig. 4.
A comparison of the mODP and the full ODP method based on simulated data. Each panel is the average number of genes called significant for each q-value cutoff over 100 simulated datasets. Solid colored lines are the proposed mODP method for different numbers of modules K and the black dashed line is the full ODP. The simulations correspond to (A) two group comparison, fixed equal variances, (B) two group comparison, variances Uniform sampled, (C) two group comparison, variances Gamma sampled and (D) two group comparison, variances Uniform mixture sampled, (E) three group comparison, fixed equal variances, (F) three group comparison, variances Uniform sampled, (G) three group comparison, variances Gamma sampled and (H) three group comparison, variances Uniform mixture sampled.
Fig. 5.
Fig. 5.
A comparison of the mODP and the full ODP approaches on the Morrocan data from Idaghdour et al. (2008). In each plot, the number of significant genes is plotted versus the corresponding q-value cutoff. (A) Agadir versus Village, (B) Agadir versus Desert, (C) Desert versus Village and (D) Agadir versus Desert versus Village (three group comparison). The mODP performs nearly identically to the full ODP, particularly when K ≥ 50.

Similar articles

See all similar articles

Cited by 6 articles

See all "Cited by" articles

Publication types

Feedback