Motivation: Genetic regulation of cellular processes is frequently investigated using large-scale gene expression experiments to observe changes in expression over time. This temporal data poses a challenge to classical distance-based clustering methods due to its horizontal dependencies along the time-axis. We propose to use hidden Markov models (HMMs) to explicitly model these time-dependencies. The HMMs are used in a mixture approach that we show to be superior over clustering. Furthermore, mixtures are a more realistic model of the biological reality, as an unambiguous partitioning of genes into clusters of unique functional assignment is impossible. Use of the mixture increases robustness with respect to noise and allows an inference of groups at varying level of assignment ambiguity. A simple approach, partially supervised learning, allows to benefit from prior biological knowledge during the training. Our method allows simultaneous analysis of cyclic and non-cyclic genes and copes well with noise and missing values.
Results: We demonstrate biological relevance by detection of phase-specific groupings in HeLa time-course data. A benchmark using simulated data, derived using assumptions independent of those in our method, shows very favorable results compared to the baseline supplied by k-means and two prior approaches implementing model-based clustering. The results stress the benefits of incorporating prior knowledge, whenever available.
Availability: A software package implementing our method is freely available under the GNU general public license (GPL) at http://ghmm.org/gql