Motivation: Carbohydrate sugar chains, or glycans, are considered the third major class of biomolecules after DNA and proteins. They consist of branching monosaccharides, starting from a single monosaccharide. They are extremely vital to the development and functioning of multicellular organisms because they are recognized by various proteins to allow them to perform specific functions. Our motivation is to study this recognition mechanism using informatics techniques from the data available. Previously, we introduced a probabilistic sibling-dependent tree Markov model (PSTMM), which we showed could be efficiently trained on sibling-dependent tree structures and return the most likely state paths. However, it had some limitations in that the extra dependency between siblings caused overfitting problems. The retrieval of the patterns from the trained model also involved manually extracting the patterns from the most likely state paths. Thus we introduce a profilePSTMM model which avoids these problems, incorporating a novel concept of different types of state transitions to handle parent-child and sibling dependencies differently.
Results: Our new algorithms are more efficient and able to extract the patterns more easily. We tested the profilePSTMM model on both synthetic (controlled) data as well as glycan data from the KEGG GLYCAN database. Additionally, we tested it on glycans which are known to be recognized and bound to proteins at various binding affinities, and we show that our results correlate with results published in the literature.