Motivation: Predicting the subcellular location of proteins is an active research area, as a protein's location within the cell provides meaningful cues about its function. Several previous experiments in utilizing text for protein subcellular location prediction varied in methods, applicability and performance level. In an earlier work we have used a preliminary text classification system and focused on the integration of text features into a sequence-based classifier to improve location prediction performance.
Results: Here the focus shifts to the text-based component itself. We introduce EpiLoc, a comprehensive text-based localization system. We provide an in-depth study of text-feature selection, and study several new ways to associate text with proteins, so that text-based location prediction can be performed for practically any protein. We show that EpiLoc's performance is comparable to (and may even exceed) that of state-of-the-art sequence-based systems. EpiLoc is available at: http://epiloc.cs.queensu.ca.