Validation and functional annotation of expression-based clusters based on gene ontology

BMC Bioinformatics. 2006 Aug 15:7:380. doi: 10.1186/1471-2105-7-380.

Abstract

Background: The biological interpretation of large-scale gene expression data is one of the paramount challenges in current bioinformatics. In particular, placing the results in the context of other available functional genomics data, such as existing bio-ontologies, has already provided substantial improvement for detecting and categorizing genes of interest. One common approach is to look for functional annotations that are significantly enriched within a group or cluster of genes, as compared to a reference group.

Results: In this work, we suggest the information-theoretic concept of mutual information to investigate the relationship between groups of genes, as given by data-driven clustering, and their respective functional categories. Drawing upon related approaches (Gibbons and Roth, Genome Research 12:1574-1581, 2002), we seek to quantify to what extent individual attributes are sufficient to characterize a given group or cluster of genes.

Conclusion: We show that the mutual information provides a systematic framework to assess the relationship between groups or clusters of genes and their functional annotations in a quantitative way. Within this framework, the mutual information allows us to address and incorporate several important issues, such as the interdependence of functional annotations and combinatorial combinations of attributes. It thus supplements and extends the conventional search for overrepresented attributes within a group or cluster of genes. In particular taking combinations of attributes into account, the mutual information opens the way to uncover specific functional descriptions of a group of genes or clustering result. All datasets and functional annotations used in this study are publicly available. All scripts used in the analysis are provided as additional files.

Publication types

  • Comparative Study
  • Research Support, Non-U.S. Gov't
  • Validation Study

MeSH terms

  • Algorithms
  • Biotechnology
  • Cluster Analysis*
  • Data Interpretation, Statistical
  • Databases, Genetic
  • Gene Expression Profiling*
  • Models, Genetic
  • Pattern Recognition, Automated / methods
  • Proteins / classification
  • Proteins / genetics*
  • Reproducibility of Results

Substances

  • Proteins