Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005 Jan 24:6:16.
doi: 10.1186/1471-2105-6-16.

Efficient decoding algorithms for generalized hidden Markov model gene finders

Affiliations

Efficient decoding algorithms for generalized hidden Markov model gene finders

William H Majoros et al. BMC Bioinformatics. .

Abstract

Background: The Generalized Hidden Markov Model (GHMM) has proven a useful framework for the task of computational gene prediction in eukaryotic genomes, due to its flexibility and probabilistic underpinnings. As the focus of the gene finding community shifts toward the use of homology information to improve prediction accuracy, extensions to the basic GHMM model are being explored as possible ways to integrate this homology information into the prediction process. Particularly prominent among these extensions are those techniques which call for the simultaneous prediction of genes in two or more genomes at once, thereby increasing significantly the computational cost of prediction and highlighting the importance of speed and memory efficiency in the implementation of the underlying GHMM algorithms. Unfortunately, the task of implementing an efficient GHMM-based gene finder is already a nontrivial one, and it can be expected that this task will only grow more onerous as our models increase in complexity.

Results: As a first step toward addressing the implementation challenges of these next-generation systems, we describe in detail two software architectures for GHMM-based gene finders, one comprising the common array-based approach, and the other a highly optimized algorithm which requires significantly less memory while achieving virtually identical speed. We then show how both of these architectures can be accelerated by a factor of two by optimizing their content sensors. We finish with a brief illustration of the impact these optimizations have had on the feasibility of our new homology-based gene finder, TWAIN.

Conclusions: In describing a number of optimizations for GHMM-based gene finders and making available two complete open-source software systems embodying these methods, it is our hope that others will be more enabled to explore promising extensions to the GHMM framework, thereby improving the state-of-the-art in gene prediction techniques.

PubMed Disclaimer

Figures

Figure 1
Figure 1
An example GHMM topology. Diamonds represent signal states (for fixed-length features) and circles represent content states (for variable-length features). Allowable transitions are shown with arrows. ATG = start codon, TAG = stop codon, GT = donor splice site, AG = acceptor splice site, N = intergenic region, I = intron, Einit = initial exon, Eint = internal exon, Efin = final exon, Esng = single exon gene. The denoted machine operates by transitioning stochastically from state to state, emitting a gene feature of a particular type upon entering a given state.
Figure 2
Figure 2
Non-overlapping of content and signal sensors. Fixed-length features such as start codons and donor sites are detected by signal sensors, which are used to score an entire context window surrounding the signal. To avoid double-counting, content sensors score only the nucleotides strictly between two signal sensors. In this example, the CTA at the end of the start codon sensor window and the CGA at the beginning of the donor site sensor window are not scored by the exon content sensor, even though they are part of the putative exon, since those bases are already scored by the signal sensors.
Figure 3
Figure 3
The init_nonphased() algorithm. Initialization of a noncoding array α, given a sequence S = x0..xL-1 and nth-order Markov chain M. Note that all parameters are assumed passed by reference. The procedure initializes each array element to the log probability of the nucleotide at the corresponding position in the sequence, conditional on some number of preceding bases.
Figure 4
Figure 4
The init_phased() algorithm. Initialization of a single exon array σ, given a sequence S = x0..xL-1, a set of three Markov chains P{0,1,2}, and initial phase (i.e., phase of the first array element) ω. All parameters are assumed to be passed by reference. This procedure is similar to init_nonphased(), except that the conditional probabilities are computed in a phase-specific manner by the appropriate member of the three-periodic Markov chain.
Figure 5
Figure 5
The eclipse() algorithm. Eclipsing signals in queue G when a stop codon has been encountered at position p. All parameters are assumed to be passed by reference. pos(s) is the position of the first base of the signal's consensus sequence (e.g., the A in ATG). len(s) is the length of the signal's consensus sequence (e.g., 3 for ATG). The procedure operates by computing the phase ω in which each signal is eclipsed by the stop codon, and then identifies those signals which are now eclipsed in all three phases. Any signal eclipsed in all three phases is then dropped from the queue, since any exon starting at that signal and extending up to the current position in the sequence would have an in-frame stop codon.
Figure 6
Figure 6
The traceback() algorithm. Reconstruction of the optimal parse by tracing back through trellis links. Parameters are the selected right-terminus signal s and its chosen phase ω. Returns a stack of signals constituting the optimal parse, with the top signal at the beginning of the parse and the bottom signal at the end. exon_length(p, s) denotes the number of coding nucleotides between p and s. The procedure operates by iteratively following the highest-scoring predecessor link from the current signal, adjusting the current phase as necessary when a trellis link corresponding to a coding feature is traversed.

Similar articles

Cited by

References

    1. Kulp D, Haussler D, Reese MG, Eeckman FH. A generalized hidden Markov model for the recognition of human genes in DNA. Proc Int Con Intell Syst Mol Biol. 1996;4:134–142. - PubMed
    1. Burge C. PhD Thesis. Department of Mathematics, Stanford University; 1997. Identification of Genes in Human Genomic DNA.
    1. Cawley SE, Wirth AI, Speed TP. Phat – a gene finding program for Plasmodium falciparum. Mol Biochem Parasitol. 2001;118:167–174. doi: 10.1016/S0166-6851(01)00363-2. - DOI - PubMed
    1. Stanke M, Waack S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics. 2003;19:II215–II225. doi: 10.1093/bioinformatics/btg1080. - DOI - PubMed
    1. Majoros WM, Pertea M, Salzberg SL. TIGRscan and GlimmerHMM: two open-source ab initio eukaryotic gene finders. Bioinformatics. 2004;20:2878–2879. doi: 10.1093/bioinformatics/bth315. - DOI - PubMed

Publication types

MeSH terms