TranscriptClean: variant-aware correction of indels, mismatches and splice junctions in long-read transcripts

Dana Wyman; Ali Mortazavi

doi:10.1093/bioinformatics/bty483

TranscriptClean: variant-aware correction of indels, mismatches and splice junctions in long-read transcripts

Bioinformatics. 2019 Jan 15;35(2):340-342. doi: 10.1093/bioinformatics/bty483.

Authors

Dana Wyman^{1

2}, Ali Mortazavi^{1

2}

Affiliations

¹ Department of Developmental and Cell Biology, UC Irvine, Irvine, CA, USA.
² Center for Complex Biological Systems, UC Irvine, Irvine, CA, USA.

Abstract

Motivation: Long-read, single-molecule sequencing platforms hold great potential for isoform discovery and characterization of multi-exon transcripts. However, their high error rates are an obstacle to distinguishing novel transcript isoforms from sequencing artifacts. Therefore, we developed the package TranscriptClean to correct mismatches, microindels and noncanonical splice junctions in mapped transcripts using the reference genome while preserving known variants.

Results: Our method corrects nearly all mismatches and indels present in a publically available human PacBio Iso-seq dataset, and rescues 39% of noncanonical splice junctions.

Availability and implementation: All Python and R scripts used in this paper are available at https://github.com/dewyman/TranscriptClean.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Computational Biology
Exons
Genome*
Humans
INDEL Mutation*
Protein Isoforms
Software*

Substances

Protein Isoforms

Abstract

Publication types

MeSH terms

Substances

Grants and funding