Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2005 Oct;272(20):5101-9.
doi: 10.1111/j.1742-4658.2005.04945.x.

Protein Database Searches Using Compositionally Adjusted Substitution Matrices

Affiliations
Free PMC article
Review

Protein Database Searches Using Compositionally Adjusted Substitution Matrices

Stephen F Altschul et al. FEBS J. .
Free PMC article

Abstract

Almost all protein database search methods use amino acid substitution matrices for scoring, optimizing, and assessing the statistical significance of sequence alignments. Much care and effort has therefore gone into constructing substitution matrices, and the quality of search results can depend strongly upon the choice of the proper matrix. A long-standing problem has been the comparison of sequences with biased amino acid compositions, for which standard substitution matrices are not optimal. To address this problem, we have recently developed a general procedure for transforming a standard matrix into one appropriate for the comparison of two sequences with arbitrary, and possibly differing compositions. Such adjusted matrices yield, on average, improved alignments and alignment scores when applied to the comparison of proteins with markedly biased compositions. Here we review the application of compositionally adjusted matrices and consider whether they may also be applied fruitfully to general purpose protein sequence database searches, in which related sequence pairs do not necessarily have strong compositional biases. Although it is not advisable to apply compositional adjustment indiscriminately, we describe several simple criteria under which invoking such adjustment is on average beneficial. In a typical database search, at least one of these criteria is satisfied by over half the related sequence pairs. Compositional substitution matrix adjustment is now available in NCBI's protein-protein version of blast.

Figures

Figure 1
Figure 1. ROCn curves for the aravind103 and astral40 data sets using standard BLOSUM-62 and conditionally compositionally adjusted BLOSUM-62
The BLAST program [25, 26, 29] was used to compare the test query sets to the test databases, with database sequences filtered of low-complexity segments using the SEG program [36] with parameters (10, 1.8, 2.1). Search results were pooled and ranked by E-value, and ROCn curves [29, 34] were obtained by plotting true positives versus false positives for increasing E-values. For each test set, local alignment scores [9] were calculated using BLOSUM-62 substitution scores [13] and affine gap costs [40, 41]. Composition-based statistics [29] were employed in order to obtain accurate E-values. Specifically, for sufficiently high-scoring alignments, the BLOSUM-62 substitution scores were scaled to have an ungapped λ [10] of 0.006352 in the context of the two sequences being compared, and were used in conjunction with scores of -550-50k for a gap of length k. Gapped statistical parameters have been estimated for this scoring system using random simulation [42], and scaling arguments [26, 29]. Also, for each test set, a second run was performed with conditionally compositionally adjusted BLOSUM-62 substitution scores, constrained to have a relative entropy of 0.44 nats in the context of the two sequences being compared (mode C). (a) The aravind103 test set was compared to a yeast protein sequence database that had been edited to remove extra copies of highly similar sequences [29]. (b) A subset of 3586 sequences from the astral40 data set [30, 31] was used as queries against astral40; all self-comparisons were excluded.

Comment in

  • Identifying protein interactions.
    Appella E, Anderson CW. Appella E, et al. FEBS J. 2005 Oct;272(20):5099-100. doi: 10.1111/j.1742-4658.2005.04944.x. FEBS J. 2005. PMID: 16218943 No abstract available.

Similar articles

See all similar articles

Cited by 317 articles

See all "Cited by" articles

LinkOut - more resources

Feedback