CAMUR: Knowledge extraction from RNA-seq cancer data through equivalent classification rules

Bioinformatics. 2016 Mar 1;32(5):697-704. doi: 10.1093/bioinformatics/btv635. Epub 2015 Oct 30.

Abstract

Motivation: Nowadays, knowledge extraction methods from Next Generation Sequencing data are highly requested. In this work, we focus on RNA-seq gene expression analysis and specifically on case-control studies with rule-based supervised classification algorithms that build a model able to discriminate cases from controls. State of the art algorithms compute a single classification model that contains few features (genes). On the contrary, our goal is to elicit a higher amount of knowledge by computing many classification models, and therefore to identify most of the genes related to the predicted class.

Results: We propose CAMUR, a new method that extracts multiple and equivalent classification models. CAMUR iteratively computes a rule-based classification model, calculates the power set of the genes present in the rules, iteratively eliminates those combinations from the data set, and performs again the classification procedure until a stopping criterion is verified. CAMUR includes an ad-hoc knowledge repository (database) and a querying tool.We analyze three different types of RNA-seq data sets (Breast, Head and Neck, and Stomach Cancer) from The Cancer Genome Atlas (TCGA) and we validate CAMUR and its models also on non-TCGA data. Our experimental results show the efficacy of CAMUR: we obtain several reliable equivalent classification models, from which the most frequent genes, their relationships, and the relation with a particular cancer are deduced.

Availability and implementation: dmb.iasi.cnr.it/camur.php

Contact: emanuel@iasi.cnr.it

Supplementary information: Supplementary data are available at Bioinformatics online.

MeSH terms

  • Algorithms
  • High-Throughput Nucleotide Sequencing
  • Humans
  • Neoplasms*
  • RNA
  • Sequence Analysis, RNA

Substances

  • RNA