Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Oct 14;4:73.
doi: 10.1186/1755-8794-4-73.

Quantifying Stability in Gene List Ranking Across Microarray Derived Clinical Biomarkers

Affiliations
Free PMC article

Quantifying Stability in Gene List Ranking Across Microarray Derived Clinical Biomarkers

Sebastian Schneckener et al. BMC Med Genomics. .
Free PMC article

Abstract

Background: Identifying stable gene lists for diagnosis, prognosis prediction, and treatment guidance of tumors remains a major challenge in cancer research. Microarrays measuring differential gene expression are widely used and should be versatile predictors of disease and other phenotypic data. However, gene expression profile studies and predictive biomarkers are often of low power, requiring numerous samples for a sound statistic, or vary between studies. Given the inconsistency of results across similar studies, methods that identify robust biomarkers from microarray data are needed to relay true biological information. Here we present a method to demonstrate that gene list stability and predictive power depends not only on the size of studies, but also on the clinical phenotype.

Results: Our method projects genomic tumor expression data to a lower dimensional space representing the main variation in the data. Some information regarding the phenotype resides in this low dimensional space, while some information resides in the residuum. We then introduce an information ratio (IR) as a metric defined by the partition between projected and residual space. Upon grouping phenotypes such as tumor tissue, histological grades, relapse, or aging, we show that higher IR values correlated with phenotypes that yield less robust biomarkers whereas lower IR values showed higher transferability across studies. Our results indicate that the IR is correlated with predictive accuracy. When tested across different published datasets, the IR can identify information-rich data characterizing clinical phenotypes and stable biomarkers.

Conclusions: The IR presents a quantitative metric to estimate the information content of gene expression data with respect to particular phenotypes.

Figures

Figure 1
Figure 1
Information partition between residual and projected space. The data comparisons demonstrate the partitioning of information between projected data Sn and residual data Sr in comparison to the original data. The x-axis shows p-values of differential gene expression in the original data, while the y-axis shows p-values for projected (blue) and residual (red) data. Qualitative different types of information partitioning are demonstrated: (a) Type 1: control tissues are compared with lung cancer samples, (b) Type 2: non-smoker (no stress response) lung tissue is compared with smoker (stress response) samples, (c) Type 3: metastatic breast cancer tissue compared with non-metastatic samples.
Figure 2
Figure 2
Mean information ratios for differential phenotypes across the studies. Low IR values are obtained for e.g. tumor vs. control lung tissue or mamma carcinoma grade 1 or 2 vs. grade 3. Higher IR values are seen in e.g. relapse vs. relapse-free.
Figure 3
Figure 3
P-values of differential gene expression compared between two studies. Depending on the particular factor, p-values of differential gene expression may be dissimilar between studies [(a) grade 1 or 2 and (b) relapse], or similar [(c), grade 1&2 versus grade 3]. Genes that show similar differential expression in both studies are close to the diagonal.
Figure 4
Figure 4
Relationship between gene list overlap and IR. For multiple breast cancer studies IR values of grade, size, age, ER status, and relapse are compared to the gene list overlap. Each data point represents a pair of studies with the mean IR (x-axis) and the percenatage of overlapping genes (POG) of the top 5% of p-values (y-axis).
Figure 5
Figure 5
Gene list stability. Sample size can determine the stability of rank gene lists in most cases. The y axis is the percentage of overlapping genes (POG) in the top 5% list between two compared studies and the x axis displays the logarithmic sample size. (a) The black stars indicating IR values ≤ 0.25 and correlating with Type 1 phenotypic classifications, show linear and thus stable behavior whereas the red stars indicating IR values > 0.25 and correlating with Types 2 and 3 phenotypic classifications, show less uniform distribution and are thus unstable (overall r2 = 0.15). (b) Gene list stability and the logarithm of the IR show a linear relation (with r2 = 0.76).
Figure 6
Figure 6
Information ratio versus intra study prediction accuracy. The x-axis shows the information ratio of different studies/factors. The y-axis indicates the out-of-bag prediction accuracy. The vertical dashed line delineates low and high IRs, the solid trend line indicates the decrease of accuracy with increasing IR.
Figure 7
Figure 7
Information ratio versus inter study prediction accuracy. The use of biomarkers across studies decreases the prediction accuracy. The extent of accuracy loss (y-axis) depends on the IR (x-axis), as indicated by a steep descent of the solid trend line. A dashed vertical line delineates high and low IRs. Each dot represents the mean loss of accuracy for all studies when compared to the biomarker source study accuracy.
Figure 8
Figure 8
Exponential distribution of p-values of differential expression. The y-axis is the logarithm of the ratio of genes in the same bin of p-values with respect to all genes.

Similar articles

See all similar articles

Cited by 6 articles

See all "Cited by" articles

References

    1. Rhodes DR, Yu J, Shanker K, Deshpande N, Varambally R, Ghosh D, Barrette T, Pandey A, Chinnaiyan AM. Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proc Natl Acad Sci USA. 2004;101(25):9309–9314. doi: 10.1073/pnas.0401994101. - DOI - PMC - PubMed
    1. Ein-Dor L, Zuk O, Domany E. Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc Natl Acad Sci USA. 2006;103(15):5923–5928. doi: 10.1073/pnas.0601231103. - DOI - PMC - PubMed
    1. Dai JJ, Lieu L, Rocke D. Dimension reduction for classification with gene expression microarray data. Stat Appl Genet Mol Biol. 2006;5:Article6. - PubMed
    1. Li L. Dimension reduction for high-dimensional data. Methods Mol Biol. pp. 417–434. - PubMed
    1. Lukk M, Kapushesky M, Nikkila J, Parkinson H, Goncalves A, Huber W, Ukkonen E, Brazma A. A global map of human gene expression. Nat Biotechnol. pp. 322–324. - PMC - PubMed

Publication types

Substances

LinkOut - more resources

Feedback