A data mining paradigm for identifying key factors in biological processes using gene expression data

Sci Rep. 2018 Jun 13;8(1):9083. doi: 10.1038/s41598-018-27258-8.


A large volume of biological data is being generated for studying mechanisms of various biological processes. These precious data enable large-scale computational analyses to gain biological insights. However, it remains a challenge to mine the data efficiently for knowledge discovery. The heterogeneity of these data makes it difficult to consistently integrate them, slowing down the process of biological discovery. We introduce a data processing paradigm to identify key factors in biological processes via systematic collection of gene expression datasets, primary analysis of data, and evaluation of consistent signals. To demonstrate its effectiveness, our paradigm was applied to epidermal development and identified many genes that play a potential role in this process. Besides the known epidermal development genes, a substantial proportion of the identified genes are still not supported by gain- or loss-of-function studies, yielding many novel genes for future studies. Among them, we selected a top gene for loss-of-function experimental validation and confirmed its function in epidermal differentiation, proving the ability of this paradigm to identify new factors in biological processes. In addition, this paradigm revealed many key genes in cold-induced thermogenesis using data from cold-challenged tissues, demonstrating its generalizability. This paradigm can lead to fruitful results for studying molecular mechanisms in an era of explosive accumulation of publicly available biological data.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Animals
  • Biological Phenomena / genetics*
  • Cluster Analysis
  • Cold Temperature
  • Data Mining / methods*
  • Gene Expression Profiling / methods*
  • Gene Ontology
  • Humans
  • Mice
  • Skin / growth & development
  • Skin / metabolism*
  • Thermogenesis / genetics*