Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Aug 15;33(16):2471-2478.
doi: 10.1093/bioinformatics/btx221.

Domain Prediction With Probabilistic Directional Context

Affiliations
Free PMC article

Domain Prediction With Probabilistic Directional Context

Alejandro Ochoa et al. Bioinformatics. .
Free PMC article

Abstract

Motivation: Protein domain prediction is one of the most powerful approaches for sequence-based function prediction. Although domain instances are typically predicted independently of each other, newer approaches have demonstrated improved performance by rewarding domain pairs that frequently co-occur within sequences. However, most of these approaches have ignored the order in which domains preferentially co-occur and have also not modeled domain co-occurrence probabilistically.

Results: We introduce a probabilistic approach for domain prediction that models 'directional' domain context. Our method is the first to score all domain pairs within a sequence while taking their order into account, even for non-sequential domains. We show that our approach extends a previous Markov model-based approach to additionally score all pairwise terms, and that it can be interpreted within the context of Markov random fields. We formulate our underlying combinatorial optimization problem as an integer linear program, and demonstrate that it can be solved quickly in practice. Finally, we perform extensive evaluation of domain context methods and demonstrate that incorporating context increases the number of domain predictions by ∼15%, with our approach dPUC2 (Domain Prediction Using Context) outperforming all competing approaches.

Availability and implementation: dPUC2 is available at http://github.com/alexviiia/dpuc2.

Contact: mona@cs.princeton.edu.

Supplementary information: Supplementary data are available at Bioinformatics online.

Figures

Fig. 1
Fig. 1
(A) Geometric view of the relaxed LP constraints of Equations (3–5). All xi, xj and xi,j are constrained by Equation (2) to the unit cube shown. Equation colors match their translucent surfaces, and the blue surface is behind the others. The vertices of the polytope resulting from the intersection of these three constraints and the cube arise at integral values of xi, xj and xi,j (green crosses). (B). Previously, we described another ILP (Ochoa et al., 2011) that consisted of two constraints relating xi, xj, and xi,j; that formulation is weaker than our current one as its polytope (the space between the two planes and inside the unit cube) is larger and has undesirable fractional vertices (red crosses) along with integral vertices (green crosses)
Fig. 2
Fig. 2
Illustration of FDR tests. (A) RevSeq and MarkovR have a real (green line) and random sequence of the same length (reversed or Markov sequence; red line). Methods select domains pooled from both sequences (black line), and real sequence domains (boxes 1 and 2) are TPs, while random sequence domains (boxes 3, 4) are FPs. (B) OrthoC labels domains as TPs if there are domains of the same clan with P < 1e-4 in orthologs (connected by green edges: boxes 1, 5; 2, 4, 7), FPs otherwise (the clans of 3, 6 are not in orthologs)
Fig. 3
Fig. 3
Our new approach, dPUC2, predicts more domains than its competitors across a wide range of FDRs, as estimated by the RevSeq (left), MarkovR (middle), and OrthoC (right) tests. The dark gray cross gives the FDR of the Standard Pfam, and the changes in the number of domain predictions for all methods at different FDRs are given with respect to the number of Standard Pfam predictions. For reference, we highlight with crosses the performances of dPUC2, dPUC1, CODD and DAMA when run on candidate domains identified with HMMER at P < 1e-4
Fig. 4
Fig. 4
Distribution of wallclock runtimes on human protein sequences using dPUC1 or dPUC2 on a 3.2 GHz processor. A candidate domain threshold of P < 1e-4 was used in both cases. Density is for the log of the runtime

Similar articles

See all similar articles

Cited by 1 article

References

    1. Apic G. et al. (2001) Domain combinations in archaeal, eubacterial and eukaryotic proteomes. J. Mol. Biol., 310, 311–325. - PubMed
    1. Beaussart F. et al. (2007) Automated Improvement of Domain ANnotations using context analysis of domain arrangements (AIDAN). Bioinformatics, 23, 1834–1836. - PubMed
    1. Berkelaar M. et al. (2004). lp_solve: Open source (Mixed-Integer) Linear Programming system. http://lpsolve.sourforce.net.
    1. Bernardes J.S. et al. (2016a) A multi-objective optimisation approach accurately resolves protein domain architectures. Bioinformatics, 32, 345–353. - PMC - PubMed
    1. Bernardes J. et al. (2016b) Improvement in protein domain identification is reached by breaking consensus, with the agreement of many profiles and domain co-occurrence. PLOS Comput. Biol., 12, e1005038.. - PMC - PubMed
Feedback