Benchmark analysis of algorithms for determining and quantifying full-length mRNA splice forms from RNA-seq data

Katharina E Hayer; Angel Pizarro; Nicholas F Lahens; John B Hogenesch; Gregory R Grant

doi:10.1093/bioinformatics/btv488

Benchmark analysis of algorithms for determining and quantifying full-length mRNA splice forms from RNA-seq data

Bioinformatics. 2015 Dec 15;31(24):3938-45. doi: 10.1093/bioinformatics/btv488. Epub 2015 Sep 3.

Authors

Katharina E Hayer¹, Angel Pizarro², Nicholas F Lahens³, John B Hogenesch³, Gregory R Grant⁴

Affiliations

¹ University of Pennsylvania, Institute for Translational Medicine and Therapeutics, Philadelphia, PA 19104.
² Scientific Computing at Amazon Web Services, Seattle, WA 98108.
³ Department of Pharmacology and.
⁴ University of Pennsylvania, Institute for Translational Medicine and Therapeutics, Philadelphia, PA 19104, Department of Genetics, University of Pennsylvania, Philadelphia, PA 19104, USA.

Abstract

Motivation: Because of the advantages of RNA sequencing (RNA-Seq) over microarrays, it is gaining widespread popularity for highly parallel gene expression analysis. For example, RNA-Seq is expected to be able to provide accurate identification and quantification of full-length splice forms. A number of informatics packages have been developed for this purpose, but short reads make it a difficult problem in principle. Sequencing error and polymorphisms add further complications. It has become necessary to perform studies to determine which algorithms perform best and which if any algorithms perform adequately. However, there is a dearth of independent and unbiased benchmarking studies. Here we take an approach using both simulated and experimental benchmark data to evaluate their accuracy.

Results: We conclude that most methods are inaccurate even using idealized data, and that no method is highly accurate once multiple splice forms, polymorphisms, intron signal, sequencing errors, alignment errors, annotation errors and other complicating factors are present. These results point to the pressing need for further algorithm development.

Availability and implementation: Simulated datasets and other supporting information can be found at http://bioinf.itmat.upenn.edu/BEERS/bp2.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Evaluation Study
Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms*
Alternative Splicing*
Animals
Benchmarking
Gene Expression Profiling / methods*
Humans
Mice
RNA Isoforms / analysis*
RNA, Messenger / analysis
Sequence Analysis, RNA / methods*

Substances

RNA Isoforms
RNA, Messenger

Abstract

Publication types

MeSH terms

Substances

Grants and funding