Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Apr 12;13 Suppl 5(Suppl 5):S2.
doi: 10.1186/1471-2105-13-S5-S2.

PIntron: A Fast Method for Detecting the Gene Structure Due to Alternative Splicing via Maximal Pairings of a Pattern and a Text

Affiliations
Free PMC article

PIntron: A Fast Method for Detecting the Gene Structure Due to Alternative Splicing via Maximal Pairings of a Pattern and a Text

Yuri Pirola et al. BMC Bioinformatics. .
Free PMC article

Abstract

Background: A challenging issue in designing computational methods for predicting the gene structure into exons and introns from a cluster of transcript (EST, mRNA) sequences, is guaranteeing accuracy as well as efficiency in time and space, when large clusters of more than 20,000 ESTs and genes longer than 1 Mb are processed. Traditionally, the problem has been faced by combining different tools, not specifically designed for this task.

Results: We propose a fast method based on ad hoc procedures for solving the problem. Our method combines two ideas: a novel algorithm of proved small time complexity for computing spliced alignments of a transcript against a genome, and an efficient algorithm that exploits the inherent redundancy of information in a cluster of transcripts to select, among all possible factorizations of EST sequences, those allowing to infer splice site junctions that are largely confirmed by the input data. The EST alignment procedure is based on the construction of maximal embeddings, that are sequences obtained from paths of a graph structure, called embedding graph, whose vertices are the maximal pairings of a genomic sequence T and an EST P. The procedure runs in time linear in the length of P and T and in the size of the output.The method was implemented into the PIntron package. PIntron requires as input a genomic sequence or region and a set of EST and/or mRNA sequences. Besides the prediction of the full-length transcript isoforms potentially expressed by the gene, the PIntron package includes a module for the CDS annotation of the predicted transcripts.

Conclusions: PIntron, the software tool implementing our methodology, is available at http://www.algolab.eu/PIntron under GNU AGPL. PIntron has been shown to outperform state-of-the-art methods, and to quickly process some critical genes. At the same time, PIntron exhibits high accuracy (sensitivity and specificity) when benchmarked with ENCODE annotations.

Figures

Figure 1
Figure 1
The colored directed graph representing a gene structure. The represented gene structure, induced by compositions, is composed by 6 genomic exons: A, B, C, C', D, E. Dashed edges represent noncoding regions, bold edges represent regions included into all the gene isoforms, and the remaining normal edges represent regions that are both coding and noncoding (i.e. are included into some gene isoform and are retained as a part of an intron into some other isoform). For clarity, we indicated an exon with a curve above the graph, and an intron with two connected segments below the graph. Observe that C and C' are competing exons, while exons B and D are cassette exons.
Figure 2
Figure 2
An embedding and its relationships with the genome and a transcript. The x1,...,x9 are substrings shared by the genome and the transcript corresponding to pairings. Each common substring (pairing) is longer than a fixed threshold ℓE. Intuitively, when the distance (measured on the genome) between two consecutive pairings is smaller than ℓD then we assume that those pairings belong to the same exon. When the same distance is larger than ℓI then those pairings belong to different exons.
Figure 3
Figure 3
Possible relative positions of two maximal pairings connected by an embedding graph edge. The figure presents the possible configurations of relative positions of two maximal pairings ek = (pk, tk, lk) and vk+1 = (pk+1, tk+1, lk+1) connected by an embedding graph edge (ek, vk+1). Each box represents a common maximal factor on T (top) and P (bottom) of a maximal pairing. Each maximal pairing is represented by two boxes connected by lines (boxes representing ek are in bold). For each case, tk corresponds to the left border of the upper bold box, pk is the left border of the lower bold box, tk+1 is the left border of the upper normal box, and pk+1 is the left border of the lower normal box. Distance |(tk+1 - tk) - (pk+1 - pk)| has been represented by a double ended arrow, while factor overlaps are highlighted by grey shades. Four possible cases are presented: (a) ek, vk+1 overlap on both T and P, (b) ek, vk+1 overlap on T but not on P , (c) ek, vk+1 overlap on P but not on T, and (d) ek, vk+1 do not overlap neither on T nor on P.
Figure 4
Figure 4
Accuracy achieved by PIntron, Exogean and ASPic at various levels. The boxplot presents the distribution of specificity and sensitivity achieved by the three tools at the exon, intron, transcript and nucleotide levels. The vertical edges of the boxes represent the first quartile, the median and the third quartiles (from left to right). The cross is the average. The vertical dashed lines represent an estimate of the 95% confidence interval of the median. The circles are all the outliers with respect to such confidence interval.

Similar articles

See all similar articles

Cited by 4 articles

References

    1. Caceres J, Kornblihtt A. Alternative splicing: multiple control mechanisms and involvement in human disease. Trends Genet. 2002;18(4):186–193. doi: 10.1016/S0168-9525(01)02626-9. - DOI - PubMed
    1. Heber S, Alekseyev M, Sze SH, Tang H, Pevzner PA. Splicing graphs and EST assembly problem. Bioinformatics. 2002;18(Suppl 1):S181–S188. doi: 10.1093/bioinformatics/18.suppl_1.S181. - DOI - PubMed
    1. Leipzig J, Pevzner P, Heber S. The Alternative Splicing Gallery (ASG): bridging the gap between genome and transcriptome. Nucleic Acids Research. 2004;32(13):3977–3983. doi: 10.1093/nar/gkh731. - DOI - PMC - PubMed
    1. Xing Y, Resch A, Lee C. The multiassembly problem: reconstructing multiple transcript isoforms from EST fragment mixtures. Genome Research. 2004;14(3):426–441. doi: 10.1101/gr.1304504. - DOI - PMC - PubMed
    1. Kim N, Shin S, Lee S. ECgene: genome-based EST clustering and gene modeling for alternative splicing. Genome Research. 2005;15(4):566–576. doi: 10.1101/gr.3030405. - DOI - PMC - PubMed

Publication types

Feedback