RNA Coding Potential Prediction Using Alignment-Free Logistic Regression Model

Methods Mol Biol. 2021;2254:27-39. doi: 10.1007/978-1-0716-1158-6_3.

Abstract

CPAT (Coding-Potential Assessment Tool) is a logistic regression model-based classifier that can accurately and quickly distinguish protein-coding and noncoding RNAs using pure linguistic features calculated from the RNA sequences. CPAT takes as input the nucleotides sequences or genomic coordinates of RNAs and outputs the probabilities p (0 ≤ p ≤ 1), which measure the likelihood of protein coding. Users can run CPAT online ( http://lilab.research.bcm.edu/cpat/ ) or from the local computers after installation. CPAT provides prebuilt logistic models to recognize RNAs originated from human (Homo sapiens), mouse (Mus musculus), zebrafish (Danio rerio), and fly (Drosophila melanogaster) genomes. Instructions on how to train models for other genomes are described in CPAT website ( http://rna-cpat.sourceforge.net/ ) and this chapter.

Keywords: LincRNA; LncRNA; Logistic regression; Noncoding RNA; Prediction; Protein coding.

MeSH terms

  • Computational Biology / methods*
  • Genome
  • Internet
  • Logistic Models
  • Open Reading Frames / genetics*
  • Probability
  • RNA / genetics*
  • Sequence Alignment*
  • Software

Substances

  • RNA