Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Jan;20(1):110-21.
doi: 10.1101/gr.097857.109. Epub 2009 Oct 26.

Detection of Nonneutral Substitution Rates on Mammalian Phylogenies

Affiliations
Free PMC article

Detection of Nonneutral Substitution Rates on Mammalian Phylogenies

Katherine S Pollard et al. Genome Res. .
Free PMC article

Abstract

Methods for detecting nucleotide substitution rates that are faster or slower than expected under neutral drift are widely used to identify candidate functional elements in genomic sequences. However, most existing methods consider either reductions (conservation) or increases (acceleration) in rate but not both, or assume that selection acts uniformly across the branches of a phylogeny. Here we examine the more general problem of detecting departures from the neutral rate of substitution in either direction, possibly in a clade-specific manner. We consider four statistical, phylogenetic tests for addressing this problem: a likelihood ratio test, a score test, a test based on exact distributions of numbers of substitutions, and the genomic evolutionary rate profiling (GERP) test. All four tests have been implemented in a freely available program called phyloP. Based on extensive simulation experiments, these tests are remarkably similar in statistical power. With 36 mammalian species, they all appear to be capable of fairly good sensitivity with low false-positive rates in detecting strong selection at individual nucleotides, moderate selection in 3-bp elements, and weaker or clade-specific selection in longer elements. By applying phyloP to mammalian multiple alignments from the ENCODE project, we shed light on patterns of conservation/acceleration in known and predicted functional elements, approximate fractions of sites subject to constraint, and differences in clade-specific selection in the primate and glires clades. We also describe new "Conservation" tracks in the UCSC Genome Browser that display both phyloP and phastCons scores for genome-wide alignments of 44 vertebrate species.

Figures

Figure 1.
Figure 1.
Receiver operating characteristic (ROC) curves showing false-positive versus true-positive rates for the all-branch tests implemented in phyloP: (red) LRT, (green) SCORE, (blue) SPH, and (purple) GERP. Individual plots show results for simulated data sets with either 3-bp (top) or 1-bp (bottom) elements generated from models with a range of deviations ρ from the neutral rate ρ = 1.0 (columns).
Figure 2.
Figure 2.
Estimated FDR for all-branch LRT. Estimates of false discovery rate (FDR) versus true-positive rate (TPR) based on two indirect methods, for 1-bp and 3-bp elements. (CDS2) Average TPRs are estimated from second codon position sites; (mixture) average TPRs are estimated by decomposing the genome-wide score distribution into components corresponding to neutral and selected sites. Details are given in Supplemental section S2.8.
Figure 3.
Figure 3.
Subtree ROC curves. (Left) Phylogenetic tree used in this study, with branch lengths drawn in proportion to the values estimated from 4D sites. Three subtrees are highlighted: (maroon) primates, (gold) glires, and (blue) laurasiatherians. (Right) ROC curves for the LRT (red) and SCORE (green) subtree tests as applied to 3-bp and 10-bp elements under clade-specific selection in the primates (top) and laurasiatherians (bottom). (The SPH method did not perform as well, and the subtree test is not supported with the GERP method.) Results are shown for the case in which ρ = 1.0 and λ = 0.3, meaning that the clade of interest is evolving at approximately one-third the neutral rate, while the rest of the tree is neutrally evolving.
Figure 4.
Figure 4.
Distributions of all-branch scores. (A) Cumulative distribution functions (CDFs) for phyloP scores in sites of different annotation classes, based on the LRT method and 36-species multiple alignments for the ENCODE regions. Positive scores indicate conservation, and negative scores indicate acceleration (CONACC mode) (see Methods). Curves are shown for first, second, and third codon positions (CDS1, CDS2, CDS3), 5′ and 3′ UTRs, noncoding RNAs (ncRNAs), predicted transcription factor binding sites (TFBS), conserved elements identified by phastCons, intergenic sites, and ancestral repeats (AR). (See Supplemental Fig. S6 for additional annotation classes.) (B) Average conservation scores as a function of genomic position within 52 predicted NRSF binding sites in the ENCODE regions. Binding sites were predicted at ChIP/chip peaks using the motif from TRANSFAC (FDR = 20%) (Supplemental section S2.9). A sequence logo representation of the motif is shown for comparison. Notice the general correlation between information content and cross-species conservation across the positions of the motif (see Moses et al. 2003). (C) Estimated fractions of sites under selection for each annotation class. Classes include those from A, plus 5′ and 3′ flanking regions of genes, sequence-specific regulatory binding regions (RFBR-Seqsp), putative transcriptional fragments of unknown function (Un.TxFrags), intronic sites, and nonconserved nongenic (NCNG) sites. These are estimates of lower bounds computed by a simple mixture-decomposition method (see Methods) and should be considered approximate. All classes show a highly significant enrichment for conserved sites relative to the AR distribution by a one-sided Mann-Whitney U test (P ≈ 0) except the 3′ flank, intronic, Un.TxFrags, and NCNG categories (all P ≈ 1).
Figure 5.
Figure 5.
Distributions of subtree scores for the primate and glires clades. Cumulative distribution functions (CDFs) of scores for selected annotation classes as computed by the subtree test for the primate (A) and glires (B) clades. As in previous figures, CONACC scores computed by the LRT method are shown, but in this case, scores are computed in a 10-bp sliding window. In both figures most distributions are significantly different from the AR distribution by a two-sided Mann-Whitney U test even when the curves appear very similar, because the data sets are generally quite large (exceptions are phastCons and TFBS in A and 5′ flank and TFBS in B).
Figure 6.
Figure 6.
Conservation track in UCSC Genome Browser. A portion of the desmoglein 1 (DSG1) gene on human chromosome 18 shown with the new Conservation track, including a 44-way vertebrate alignment and nine conservation subtracks. The subtracks display phyloP scores (in blue and red), phastCons scores (green), and phastCons-predicted conserved elements (pink, purple, and mustard) for all species, the 32 placental mammals, and the nine primates (bottom to top within each group). (A) The phyloP and phastCons scores are broadly similar when the display is zoomed out, with scores near zero for most noncoding regions but elevated in exons (thick blue bars at top) as well as in conserved noncoding elements (orange arrow). (B) At finer resolution, however, phyloP reveals significantly more variation from base to base than does the hidden Markov model–based phastCons. In this coding exon, codon position effects are clearly evident from phyloP but not from phastCons. (C,D) The phyloP tracks also indicate accelerated evolution (with negative scores, shown in red), while phastCons measures conservation only. Here an exon with a striking fast-evolving segment is shown. Interestingly, cDNA data from other mammals suggest that this exon derives from a fusion of two ancestral exons, with the fast-evolving segment corresponding to the ancestral intron.

Similar articles

See all similar articles

Cited by 726 articles

See all "Cited by" articles

Publication types

LinkOut - more resources

Feedback