Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 9 Suppl 9 (Suppl 9), S10

The Balance of Reproducibility, Sensitivity, and Specificity of Lists of Differentially Expressed Genes in Microarray Studies

Affiliations

The Balance of Reproducibility, Sensitivity, and Specificity of Lists of Differentially Expressed Genes in Microarray Studies

Leming Shi et al. BMC Bioinformatics.

Abstract

Background: Reproducibility is a fundamental requirement in scientific experiments. Some recent publications have claimed that microarrays are unreliable because lists of differentially expressed genes (DEGs) are not reproducible in similar experiments. Meanwhile, new statistical methods for identifying DEGs continue to appear in the scientific literature. The resultant variety of existing and emerging methods exacerbates confusion and continuing debate in the microarray community on the appropriate choice of methods for identifying reliable DEG lists.

Results: Using the data sets generated by the MicroArray Quality Control (MAQC) project, we investigated the impact on the reproducibility of DEG lists of a few widely used gene selection procedures. We present comprehensive results from inter-site comparisons using the same microarray platform, cross-platform comparisons using multiple microarray platforms, and comparisons between microarray results and those from TaqMan - the widely regarded "standard" gene expression platform. Our results demonstrate that (1) previously reported discordance between DEG lists could simply result from ranking and selecting DEGs solely by statistical significance (P) derived from widely used simple t-tests; (2) when fold change (FC) is used as the ranking criterion with a non-stringent P-value cutoff filtering, the DEG lists become much more reproducible, especially when fewer genes are selected as differentially expressed, as is the case in most microarray studies; and (3) the instability of short DEG lists solely based on P-value ranking is an expected mathematical consequence of the high variability of the t-values; the more stringent the P-value threshold, the less reproducible the DEG list is. These observations are also consistent with results from extensive simulation calculations.

Conclusion: We recommend the use of FC-ranking plus a non-stringent P cutoff as a straightforward and baseline practice in order to generate more reproducible DEG lists. Specifically, the P-value cutoff should not be stringent (too small) and FC should be as large as possible. Our results provide practical guidance to choose the appropriate FC and P-value cutoffs when selecting a given number of DEGs. The FC criterion enhances reproducibility, whereas the P criterion balances sensitivity and specificity.

Figures

Figure 1
Figure 1
Concordance for inter-site comparisons. Each panel represents the POG results for a commercial platform of inter-site consistency in terms of DEGs between samples B and A. For each of the six gene selection methods, there are three possible inter-site comparisons: S1–S2, S1–S3, and S2–S3 (S = Site). Therefore, each panel consists of 18 POG lines that are colored based on gene ranking/selection method. Results shown here are based on the entire set of "12,091" genes commonly mapped across the microarray platforms without noise (absent call) filtering. POG results are improved when the analyses are performed using the subset of genes that are commonly detectable by the two test sites, as shown in Figure 2. The x-axis represents the number of selected DEGs, and the y-axis is the percentage (%) of genes common to the two gene lists derived from two test sites at a given number of DEGs.
Figure 2
Figure 2
Concordance for inter-site comparisons based on genes commonly detectable by the two test sites compared. Each panel represents the POG results for a commercial platform of inter-site consistency in terms of DEGs between samples B and A. For each of the six gene selection methods, there are three possible inter-site comparisons: S1–S2, S1–S3, and S2–S3. Therefore, each panel consists of 18 POG lines that are colored based on gene ranking/selection method. The x-axis represents the number of selected DEGs, and the y-axis is the percentage (%) of genes common to the two gene lists derived from two test sites at a given number of DEGs.
Figure 3
Figure 3
Concordance for inter-site comparison with samples C and D. The largest fold change between samples C and D is small (three-fold). For each platform, DEG lists from sites 1 and 2 are compared. Analyses are performed using the subset of genes that are commonly detectable by the two test sites.
Figure 4
Figure 4
Concordance for cross-platform comparisons. Panel a: Based on the data set of "12,091" genes (without noise filtering); Panel b: Based on subsets of genes commonly detected ("Present") by two platforms. For each platform, the data from test site1 are used for cross-platform comparison. Each POG line corresponds to comparison of the DEGs from two microarray platforms using one of the six gene selection methods. There are ten platform-platform comparison pairs, resulting in 60 POG lines for each panel. The x-axis represents the number of selected DEGs, and the y-axis is the percentage (%) of genes common to the two gene lists derived from two platforms at a given number of DEGs. POG lines circled by the blue oval are from FC based gene selection methods with or without a P cutoff, whereas POG lines circled by the teal oval are from P based gene selection methods with or without an FC cutoff. Shown here are results for comparing sample B and sample A.
Figure 5
Figure 5
Concordance between microarray and TaqMan® assays. Each panel represents the comparison of one microarray platform to TaqMan® assays. For each microarray platform, the data from test site 1 are used for comparison to TaqMan® assays. Each POG line corresponds to comparison of the DEGs from one microarray platform and those from the TaqMan® assays using one of the six gene selection methods. The x-axis represents the number of selected DEGs, and the Y-axis is the percentage (%) of genes common to DEGs derived from a microarray platform and those from TaqMan® assays. Shown here are results for comparing sample B and sample A using a subset of genes that are detectable by both the microarray platform and TaqMan® assays. Results based on the entire set of 906 genes are provided in Figure 6.
Figure 6
Figure 6
Concordance between microarray and TaqMan® assays without noise-filtering. Each panel represents the comparison of one microarray platform to TaqMan® assays. The x-axis represents the number of selected DEGs, and the y-axis is the percentage (%) of genes common to DEGs derived from a microarray platform and those from TaqMan® assays. Shown here are results for comparing sample B and sample A using the entire set of 906 genes for which TaqMan® assay data are available.
Figure 7
Figure 7
Inter-site reproducibility of log2 FC and log2 t-statistic. a: log2 FC of site 1 versus log2 FC of site 2; b: log2 t-statistic of test site 1 versus log2 t-statistic of test site 2; and c: log2 FC of test site 1 versus log2 t-statistic of test site 1. Shown here are results for comparing sample B and sample A for all "12,091" genes commonly probed by the five microarray platforms. The inter-site reproducibility of log2 FC (a) is much higher than that of log2 t-statistic (b). The relationship between log2 FC and log2 t-statistic from the same test site is non-linear and the correlation appears to be low (c).
Figure 8
Figure 8
Concordance between FC and P based gene ranking methods ("12,091 genes"; test site 1). Each POG line represents a platform using data from its first test site. The x-axis represents the number of selected DEGs, and the y-axis is the percentage (%) of genes common in the DEGs derived from FC- and P-ranking. Shown here are results for comparing sample B and sample A for all "12,091" genes commonly probed. When a smaller number of genes (up to a few hundreds or thousands) are selected, POG for cross selection method comparison (FC vs. P) is low. For example, there are only about 50% genes in common for the top 500 genes selected by FC and P separately, indicating that FC and P rank order DEGs dramatically differently. When the number of selected DEGs increases, the overlap between the two methods increases, and eventually approach to 100% in common, as expected. The low concordance between FC- and P-based gene ranking methods is not unexpected considering their different definitions and low correlation (Figure 7c).
Figure 9
Figure 9
Volcano plot illustration of joint FC and P gene selection rule. Genes in sectors A and C are selected as differentially expressed. The colors correspond to the negative log10 P and log2 fold change values: Red: 20 < -log10 P < 50 and 3 < log2 fold < 9 or -9 < log2 fold < -3. Blue: 10 < -log10 P < 50 and 2 < log2 fold < 3 or -3 < log2 fold < -2. Yellow: 4 < -log10 P < 50 and 1 < log2 fold < 2 or -2 < log2 fold < -1. Pink : 10 < -log10 P < 20 and 3 < log2 fold or log2 fold < -3. Light blue: 4 < -log10 P < 10 and 2 < log2 fold or log2 fold < -2. Light green: 2 < -log10 P < 4 and 1 < log2 fold or log2 fold < -1. Gray)
Figure 10
Figure 10
Inter-site concordance based on FC, t-test, Wilcoxon rank-sum test, and SAM. Affymetrix data on samples A and B from site 1 and site 2 for the "12,091" commonly mapped genes were used[13]. No flagged ("Absent") genes were excluded in the analysis. For the Wilcoxon rank-sum tests, there were many ties, i.e., many genes exhibited the same level of statistical significance because of the small sample sizes (five replicates for each group). The tied genes from each test site were broken (ranked) by random ordering. Concordance between genes selected completely by random choice is shown in red and reaches 50% when all candidate genes are declared as differentially expressed; the other 50% genes are in opposite regulation directions. SAM improves inter-site reproducibility compared to t-test, and approaches, but does not exceed that of fold-change.
Figure 11
Figure 11
Gene selection and percentage of agreement in gene lists in simulated data sets. Illustrations of the effect of biological context, replicate CV distribution, gene list size, and gene selection rules/methods on the reproducibility of gene lists using simulated microarray data. In some sense, these three graphs represent some extremes as well as typical scenarios in differential expression assays. However, FC sorting with low P thresholds (0.001 or 0.0001; light and medium gray boxes) consistently performed better overall than the other rules, even when FC-ranking or P-ranking by itself did not perform as well.

Similar articles

See all similar articles

Cited by 107 articles

See all "Cited by" articles

References

    1. Tan PK, Downey TJ, Spitznagel EL, Jr, Xu P, Fu D, Dimitrov DS, Lempicki RA, Raaka BM, Cam MC. Evaluation of gene expression measurements from commercial microarray platforms. Nucleic Acids Res. 2003;31:5676–5684. doi: 10.1093/nar/gkg763. - DOI - PMC - PubMed
    1. Ramalho-Santos M, Yoon S, Matsuzaki Y, Mulligan RC, Melton DA. "Stemness": transcriptional profiling of embryonic and adult stem cells. Science. 2002;298:597–600. doi: 10.1126/science.1072530. - DOI - PubMed
    1. Ivanova NB, Dimos JT, Schaniel C, Hackney JA, Moore KA, Lemischka IR. A stem cell molecular signature. Science. 2002;298:601–604. doi: 10.1126/science.1073823. - DOI - PubMed
    1. Fortunel NO, Otu HH, Ng HH, Chen J, Mu X, Chevassut T, Li X, Joseph M, Bailey C, Hatzfeld JA, et al. Comment on " 'Stemness': transcriptional profiling of embryonic and adult stem cells" and "a stem cell molecular signature". Science. 2003;302:393. doi: 10.1126/science.1086384. author reply 393. - DOI - PubMed
    1. Miller RM, Callahan LM, Casaceli C, Chen L, Kiser GL, Chui B, Kaysser-Kranich TM, Sendera TJ, Palaniappan C, Federoff HJ. Dysregulation of gene expression in the 1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine-lesioned mouse substantia nigra. J Neurosci. 2004;24:7445–7454. doi: 10.1523/JNEUROSCI.4204-03.2004. - DOI - PMC - PubMed

Publication types

LinkOut - more resources

Feedback