Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Jan;16(1):1-18.
doi: 10.1089/cmb.2008.0137.

Exact Calculation of Distributions on Integers, With Application to Sequence Alignment

Affiliations
Free PMC article

Exact Calculation of Distributions on Integers, With Application to Sequence Alignment

Lee A Newberg et al. J Comput Biol. .
Free PMC article

Abstract

Computational biology is replete with high-dimensional discrete prediction and inference problems. Dynamic programming recursions can be applied to several of the most important of these, including sequence alignment, RNA secondary-structure prediction, phylogenetic inference, and motif finding. In these problems, attention is frequently focused on some scalar quantity of interest, a score, such as an alignment score or the free energy of an RNA secondary structure. In many cases, score is naturally defined on integers, such as a count of the number of pairing differences between two sequence alignments, or else an integer score has been adopted for computational reasons, such as in the test of significance of motif scores. The probability distribution of the score under an appropriate probabilistic model is of interest, such as in tests of significance of motif scores, or in calculation of Bayesian confidence limits around an alignment. Here we present three algorithms for calculating the exact distribution of a score of this type; then, in the context of pairwise local sequence alignments, we apply the approach so as to find the alignment score distribution and Bayesian confidence limits.

Figures

FIG. 1.
FIG. 1.
The exact credibility distributions, the distribution of the pairing distances of the posterior weighted ensemble of alignments from the centroid alignment and from the maximum-score alignment, for the eighth pair of sequences in the human-rodent data set. The higher peak, on the left, is for the centroid alignment; the lower peak is for the maximum-score alignment. This example clearly demonstrates the rule that the average pairing distance from the centroid alignment is less than the average pairing distance from the maximum-score alignment. Also evident is the common phenomenon of odd-even alternation; in some circumstances, the breaking of a pairing present in an estimating alignment frees up a nucleotide for participation in another advantageous pairing; thus, differences from an estimating alignment are encouraged to arise in pairs. The human nucleotide sequence with repeats removed is of length 1769, and likewise the rodent sequence is of length 1575. The centroid alignment has 1099 pairings, and the maximum-score alignment has 1123 pairings.
FIG. 2.
FIG. 2.
The exact credibility distributions for the second pair of sequences in the human-rodent data set. The distribution for the centroid alignment is slightly left of that for the maximum-score alignment. This example demonstrates the occurrence of multiple peaks; often there are multiple distinct clusters of good-quality alignments, and an estimating alignment can fall within only one of them. The human nucleotide sequence with repeats removed is of length 1691, and likewise the rodent sequence is of length 2219. The centroid alignment has 205 pairings, and the maximum-score alignment has 214 pairings.
FIG. 3.
FIG. 3.
The exact credibility distributions for the fourth pair of sequences in the human-rodent data set. The distribution for the centroid alignment is strongly to the left of that for the maximum-score alignment. The order of the peaks for the two estimating alignments is inverted, indicating that the centroid and maximum-score alignments belong to distinct clusters of high-quality alignments; the centroid alignment falls within the larger cluster. The human nucleotide sequence with repeats removed is of length 1677, and likewise the rodent sequence is of length 1666. The centroid alignment has 438 pairings, and the maximum-score alignment has 450 pairings.
FIG. 4.
FIG. 4.
For pairwise alignments of D. melanogaster sequences with each of the orthologous sequences in D. pseudo-obscura, D. erecta, D. yakuba, and D. simulans, for each of 20 intergenic regions—80 plotted points in total—we have plotted the 95% credibility limit for the centroid alignment (x coordinate) and the 95% credibility limit for the maximum-score alignment (y coordinate). The sequence lengths varied from 139 to 1000 with a mean of 634. The fact that most points lie above the y = x diagonal indicates that 95% credibility limits (i.e., error bars) are tighter for the centroid alignments. We see no apparent pattern by which to differentiate the species-specific sets of points.
FIG. 5.
FIG. 5.
Relative credibility versus significance. For each of the 80 pairwise alignments of fly sequence data, we have plotted the relative 95% credibility limit (the ratio of the 95% credibility limit for the maximum score alignment to the number of pairings in the maximum score alignment) versus statistical significance, as measured by p-value. In the first panel, with p-value of ≥10−7, the relative credibility values are widely variable and exhibit little or no relationship to p-values. In the second panel, all remaining points are plotted; there is a downward trend of mean relative credibility with diminishing p-value. However, near any given p-value there is high variability in the relative credibility among the alignments; and this is true even for the extremely significant p-values of ≤ 10−100. We conclude that it is unreasonable to assume that an alignment is credible based simply upon the fact that its p-value is strong.

Similar articles

See all similar articles

Cited by 9 articles

See all "Cited by" articles

Publication types

LinkOut - more resources

Feedback