Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Jul 29;10:234.
doi: 10.1186/1471-2105-10-234.

CLEAN: CLustering Enrichment ANalysis

Affiliations
Free PMC article

CLEAN: CLustering Enrichment ANalysis

Johannes M Freudenberg et al. BMC Bioinformatics. .
Free PMC article

Abstract

Background: Integration of biological knowledge encoded in various lists of functionally related genes has become one of the most important aspects of analyzing genome-wide functional genomics data. In the context of cluster analysis, functional coherence of clusters established through such analyses have been used to identify biologically meaningful clusters, compare clustering algorithms and identify biological pathways associated with the biological process under investigation.

Results: We developed a computational framework for analytically and visually integrating knowledge-based functional categories with the cluster analysis of genomics data. The framework is based on the simple, conceptually appealing, and biologically interpretable gene-specific functional coherence score (CLEAN score). The score is derived by correlating the clustering structure as a whole with functional categories of interest. We directly demonstrate that integrating biological knowledge in this way improves the reproducibility of conclusions derived from cluster analysis. The CLEAN score differentiates between the levels of functional coherence for genes within the same cluster based on their membership in enriched functional categories. We show that this aspect results in higher reproducibility across independent datasets and produces more informative genes for distinguishing different sample types than the scores based on the traditional cluster-wide analysis. We also demonstrate the utility of the CLEAN framework in comparing clusterings produced by different algorithms. CLEAN was implemented as an add-on R package and can be downloaded at http://Clusteranalysis.org. The package integrates routines for calculating gene specific functional coherence scores and the open source interactive Java-based viewer Functional TreeView (FTreeView).

Conclusion: Our results indicate that using the gene-specific functional coherence score improves the reproducibility of the conclusions made about clusters of co-expressed genes over using the traditional cluster-wide scores. Using gene-specific coherence scores also simplifies the comparisons of clusterings produced by different clustering algorithms and provides a simple tool for selecting genes with a "functionally coherent" expression profile.

Figures

Figure 1
Figure 1
Calculating functional coherence scores. Given a hierarchical clustering of genes based on their expression profiles and a set of functional categories (e.g. Gene Ontologies), the CLustering Enrichment ANalysis (CLEAN) score for a gene is calculated as the maximum of -log(Fisher's Exact Test q-value) of enrichment tests across all pairs of clusters containing the gene and functional categories containing the gene (see methods for details). The Cluster-wide CLEAN score (cwCLEAN) is calculated in a similar fashion except that the maximum is taken over all clusters that contain the gene and all functional categories regardless of whether they contain the gene or not.
Figure 2
Figure 2
Comparison of clustering methods. We compared functional coherence of six clustering algorithms: Context specific infinite mixture model (CSIMM), Euclidian distance based and Pearson's correlation based hierarchical clustering with and without prior variance-rescaling of the data, across four independent human breast cancer datasets (GEO expression series GSE1456 [29], GSE3494 [28], GSE7390 [30], and GSE11121 [31]). For all six algorithms, the hierarchical clustering was constructed using the average linkage principle. The number of genes common in all four datasets after filtering was 6,150. CLEAN scores are plotted against the x-axis and the corresponding number of genes with the CLEAN greater than this are plotted against the y-axis. Higher areas under the curve imply the higher functional coherence.
Figure 3
Figure 3
Integrating cluster analysis and functional knowledge. Genes were clustered using the CSIMM [22] algorithm and variance-scaled data from two independent breast cancer datasets (GSE3494 [28] and GSE7390 [31]), and CLEAN scores were computed for both clusterings. The number of genes common in both datasets after filtering was 8,567. A) The gene-specific CLEAN scores for the two datasets were plotted against each other and the Pearson's correlation coefficient was computed. A small error was added in the scatter plot to better visualize overlapping data points. B) Pairwise similarity measures between genes computed by CSIMM were also plotted and correlated. C) Expression profiles of genes with the very highest CLEAN scores in both datasets showed strong co-expression in both datasets. All genes in this cluster are immunity related.
Figure 4
Figure 4
Reproducibility of CLEAN and cwCLEAN scores. The reproducibility of the functional coherence results for 6 different clustering algorithms was assessed by calculating all pairwise Pearson's correlation coefficients between scores for all algorithms applied to four independent human breast cancer datasets (GEO expression series GSE1456 [29], GSE3494 [28], GSE7390 [31], and GSE11121 [30]). Rows and columns in this symmetric heatmap represent specific scores for a specific clustering in a specific dataset in the heatmap. The symmetric hierarchical clustering of rows and columns was constructed using pairwise Pearson's correlations between different scores as the similarity measures and applying the complete linkage principle.
Figure 5
Figure 5
Differences in the reproducibility of CLEAN and cwCLEAN scores. Improvements in the reproducibility of CLEAN over cwCLEAN scores were demonstrated by box plots of differences in correlation coefficients, and odds ratios and p-values in 2-by-2 contingency tables of statistically significant scores. A) Box plots of differences in correlations between CLEAN and cwCLEAN scores of all 6 pairs of breast cancer datasets for three different clustering algorithms. Since all differences are positive, this indicates that the correlation coefficient was higher for CLEAN scores in each of the 6 pairs. B) Box plots of differences in odds ratios for 2-by-2 contingency tables of statistically significant CLEAN and cwCLEAN scores for all 6 pairs of breast cancer datasets and three different clustering algorithms. All differences are positive implicating higher reproducibility of CLEAN scores. C) Box plots of differences in the statistical significances in (-log10(p-values)) in the Fisher's Exact test for the same contingency tables as in B). The fact that all differences are positive again implicates higher reproducibility of CLEAN scores.
Figure 6
Figure 6
Unsupervised selection of informative genes. Genes were clustered based on their expression across different tissue samples and functional coherence scores are calculated for the human and mouse datasets separately. Ability of different groups of genes to facilitate correct grouping of samples from the same tissue type in the combined human-mouse dataset was assessed by constructing ROC curves. The ROC curve for clustering samples based on all 10,287 genes is inserted in each plot (red line) for the reference. A) ROC curves for clustering samples based on genes with the statistically significant CLEAN scores in both mouse and human datasets, and genes not statistically significant in either of the datasets. B) Same as A) for the cwCLEAN instead of CLEAN scores. C) ROC curves based on genes selected using COPA. The number of selected genes was identical to the number of genes with statistically significant CLEAN scores used in A).
Figure 7
Figure 7
Integrated software package. CLEAN was implemented as an add-on R package [36]. The package integrates routines for calculating gene specific functional coherence scores and the interactive Java-based viewer Functional TreeView (FTreeView). The figure shows a screenshot of the fTreeView session displaying CLEAN results for one breast cancer dataset GSE3494 [28]. fTreeView was developed from the original Java TreeView [38] by adding panel 3, which displays functional cluster annotations generated by the CLEAN R package. This functionality enables seamless integration and browsing of functional categories associated with each cluster of genes (panel 2), which in turn can be selected based on the functional coherence scores (panel 1). The selected cluster of genes (panel 2) which we identified based on the overall high CLEAN scores (panel 1) is highly enriched for genes associate with immunity related Gene Ontology terms (FDR < 10-60) as well as two KEGG pathways, and putative targets of the Interferon Consensus Sequence-binding protein (ICSBP) transcription factor. These Results can be viewed interactively at using the Java web-start version of FTreeView.
Figure 8
Figure 8
Expression patterns of genes with statistically significant CLEAN scores in four independent breast cancer datasets. The heatmap indicates that all genes belong to clusters with coherent expression patterns in each dataset. Functional categories on the right-hand side indicate the enriched functional categories for each global cluster of co-expressed genes. This heatmap can be interactively browsed using FTreeView at .

Similar articles

See all similar articles

Cited by 30 articles

See all "Cited by" articles

References

    1. Slonim DK. From patterns to pathways: gene expression data analysis comes of age. Nat Genet. 2002;32:502–508. doi: 10.1038/ng1033. - DOI - PubMed
    1. Do JH, Choi DK. Clustering approaches to identifying gene expression patterns from DNA microarray data. Mol Cells. 2008;25:279–288. - PubMed
    1. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998;95:14863–14868. doi: 10.1073/pnas.95.25.14863. - DOI - PMC - PubMed
    1. MacQueen J. Some Methods for Classification and Analysis of Multivariate Observations. Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1965. pp. 281–297.
    1. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM. Systematic determination of genetic network architecture. Nat Genet. 1999;22:281–285. doi: 10.1038/10343. - DOI - PubMed

Publication types

LinkOut - more resources

Feedback