Long intergenic non-coding RNAs (lincRNAs) are a new type of non-coding RNAs and are closely related with the occurrence and development of diseases. In previous studies, most lincRNAs have been identified through next-generation sequencing. Because lincRNAs exhibit tissue-specific expression, the reproducibility of lincRNA discovery in different studies is very poor. In this study, not including lincRNA expression, we used the sequence, structural and protein-coding potential features as potential features to construct a classifier that can be used to distinguish lincRNAs from non-lincRNAs. The GA-SVM algorithm was performed to extract the optimized feature subset. Compared with several feature subsets, the five-fold cross validation results showed that this optimized feature subset exhibited the best performance for the identification of human lincRNAs. Moreover, the LincRNA Classifier based on Selected Features (linc-SF) was constructed by support vector machine (SVM) based on the optimized feature subset. The performance of this classifier was further evaluated by predicting lincRNAs from two independent lincRNA sets. Because the recognition rates for the two lincRNA sets were 100% and 99.8%, the linc-SF was found to be effective for the prediction of human lincRNAs.
Keywords: ACC; Classification; F-measure; Feature selection; Fm; GA; GA–SVM; LincRNAs; MCC; MFE; SE; SP; SVM; a wrapper feature selection algorithm that combines support vector machine and genetic algorithm; accuracy; correlation coefficient; genetic algorithm; linc-SF; lincRNAs; lncRNAs; long intergenic non-coding RNAs; long non-coding RNAs; microRNA precursors; minimum free energy; pre-miRNAs; sensitivity; specificity; support vector machine; the LincRNA Classifier based on Selected Features.
© 2013 Elsevier B.V. All rights reserved.