Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Jun 18;9:288.
doi: 10.1186/1471-2105-9-288.

Genome-scale Cluster Analysis of Replicated Microarrays Using Shrinkage Correlation Coefficient

Affiliations
Free PMC article

Genome-scale Cluster Analysis of Replicated Microarrays Using Shrinkage Correlation Coefficient

Jianchao Yao et al. BMC Bioinformatics. .
Free PMC article

Abstract

Background: Currently, clustering with some form of correlation coefficient as the gene similarity metric has become a popular method for profiling genomic data. The Pearson correlation coefficient and the standard deviation (SD)-weighted correlation coefficient are the two most widely-used correlations as the similarity metrics in clustering microarray data. However, these two correlations are not optimal for analyzing replicated microarray data generated by most laboratories. An effective correlation coefficient is needed to provide statistically sufficient analysis of replicated microarray data.

Results: In this study, we describe a novel correlation coefficient, shrinkage correlation coefficient (SCC), that fully exploits the similarity between the replicated microarray experimental samples. The methodology considers both the number of replicates and the variance within each experimental group in clustering expression data, and provides a robust statistical estimation of the error of replicated microarray data. The value of SCC is revealed by its comparison with two other correlation coefficients that are currently the most widely-used (Pearson correlation coefficient and SD-weighted correlation coefficient) using statistical measures on both synthetic expression data as well as real gene expression data from Saccharomyces cerevisiae. Two leading clustering methods, hierarchical and k-means clustering were applied for the comparison. The comparison indicated that using SCC achieves better clustering performance. Applying SCC-based hierarchical clustering to the replicated microarray data obtained from germinating spores of the fern Ceratopteris richardii, we discovered two clusters of genes with shared expression patterns during spore germination. Functional analysis suggested that some of the genetic mechanisms that control germination in such diverse plant lineages as mosses and angiosperms are also conserved among ferns.

Conclusion: This study shows that SCC is an alternative to the Pearson correlation coefficient and the SD-weighted correlation coefficient, and is particularly useful for clustering replicated microarray data. This computational approach should be generally useful for proteomic data or other high-throughput analysis methodology.

Figures

Figure 1
Figure 1
The performance of the three models indicated by the adjusted Rand index obtained from the synthetic data sets using hierarchical clustering and k-means clustering. The number of the replicates varies from 2 to 20. Each correlation is represented by a curve: SCC (red), SD-weighted correlation (green), and Pearson correlation (blue). Every data point on a curve is an average adjusted Rand index over 1000 trials of generating and clustering the synthetic data. Hierarchical clustering: (a) Low noise level. (b) High noise level. K-means clustering: (c) Low noise level. (d) High noise level. Error bars are not shown here because, given the scaling of the Figure, they are too small to be graphically depicted after 1000 trials.
Figure 2
Figure 2
The performance of the three correlations indicated by the adjusted Rand index obtained from the real yeast expression data using hierarchical clustering and k-means clustering. Each correlation is represented by a bar: SCC (red), SD-weighted correlation (green), and Pearson correlation (blue). The y-axis is the adjusted Rand index.
Figure 3
Figure 3
Gene-wise bias (Eigengene 5) associated with the two prints of arrays. The abundance of Eigengene 5 in each of the 24 arrays with the arrays in the order obtained in Additional file 3. The 24 dots denote all of the arrays: Cri2 arrays (red), Cri3 arrays (black). Array names are similarly color coded.
Figure 4
Figure 4
Histogram of optimal shrinkage factorλi. The mean, standard deviation, and the total number of λi are shown in the left upper corner of the histogram.
Figure 5
Figure 5
TUG expression profile in the early stages of gametophyte development of C. richardii by SCC. (a) Unsupervised two-dimensional hierarchical clustering. Data are presented in a matrix format: each row represents an individual TUG, and each column corresponding to an experimental sample. Each expression measurement represents the normalized log2 ratio of fluorescence from the hybridized experimental sample to a reference sample. Normalized TUG expression ratios are depicted by a pseudocolor scale with red indicating positive expression above the reference, black indicating equal expression as the reference, and green indicating negative expression below the reference. The horizontal colored boxes delimit four pairwise time point comparison groups: 0:24 hr (violet box), 6:24 hr (orange box), 12:24 hr (green box), and 48:24 hr (red box). The scale to the left of the dendrograms depicts the Pearson correlation coefficient represented by the length of the dendrograms branches connecting pairs of nodes. (b) The fold change scale extends from fluorescence ratios of -1 to 1 in log2units. (c) Average expression profiles of Cluster A, computed by averaging the log2(Cy5/Cy3) ratios. (d) Average expression profiles of Cluster B, computed by averaging the log2(Cy5/Cy3) ratios.

Similar articles

See all similar articles

Cited by 12 articles

See all "Cited by" articles

References

    1. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. PNAS. 1998;95:14863–14868. doi: 10.1073/pnas.95.25.14863. - DOI - PMC - PubMed
    1. Kung C, Kenski DM, Dickerson SH, Howson RW, Kuyper LF, Madhani HD, Shokat KM. Chemical genomic profiling to identify intracellular targets of a multiplex kinase inhibitor. PNAS. 2005;102:3587–3592. doi: 10.1073/pnas.0407170102. - DOI - PMC - PubMed
    1. Matsumura H, Bin Nasir KH, Yoshida K, Ito A, Kahl G, Kruger DH, Terauchi R. SuperSAGE array: the direct use of 26-base-pair transcript tags in oligonucleotide arrays. Nature Methods. 2006;3:469–474. doi: 10.1038/nmeth882. - DOI - PubMed
    1. Rengarajan J, Bloom BR, Rubin EJ. From The Cover: Genome-wide requirements for Mycobacterium tuberculosis adaptation and survival in macrophages. PNAS. 2005;102:8327–8332. doi: 10.1073/pnas.0503272102. - DOI - PMC - PubMed
    1. Hughes TR, Marton MJ, Jones AR, al Functional discovery via a compendium of expression profiles. Cell. 2000;102:109–126. doi: 10.1016/S0092-8674(00)00015-5. - DOI - PubMed

Publication types

MeSH terms

LinkOut - more resources

Feedback