Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 38 (Database issue), D161-6

PROSITE, a Protein Domain Database for Functional Characterization and Annotation

Affiliations

PROSITE, a Protein Domain Database for Functional Characterization and Annotation

Christian J A Sigrist et al. Nucleic Acids Res.

Abstract

PROSITE consists of documentation entries describing protein domains, families and functional sites, as well as associated patterns and profiles to identify them. It is complemented by ProRule, a collection of rules based on profiles and patterns, which increases the discriminatory power of these profiles and patterns by providing additional information about functionally and/or structurally critical amino acids. PROSITE is largely used for the annotation of domain features of UniProtKB/Swiss-Prot entries. Among the 983 (DNA-binding) domains, repeats and zinc fingers present in Swiss-Prot (release 57.8 of 22 September 2009), 696 ( approximately 70%) are annotated with PROSITE descriptors using information from ProRule. In order to allow better functional characterization of domains, PROSITE developments focus on subfamily specific profiles and a new profile building method giving more weight to functionally important residues. Here, we describe AMSA, an annotated multiple sequence alignment format used to build a new generation of generalized profiles, the migration of ScanProsite to Vital-IT, a cluster of 633 CPUs, and the adoption of the Distributed Annotation System (DAS) to facilitate PROSITE data integration and interchange with other sources. The latest version of PROSITE (release 20.54, of 22 September 2009) contains 1308 patterns, 863 profiles and 869 ProRules. PROSITE is accessible at: http://www.expasy.org/prosite/.

Figures

Figure 1.
Figure 1.
(A) AMSA 1.0 grammar in Extended Backus-Naur Form. Explanation of the grammar symbols: ‘=’ assign operator; ‘()’ grouping operator; ‘[]’ optional (0 or 1 occurrence); ‘{}’ repetition (0 or more occurrences); ‘-’ exclude following symbol; ‘|’ alternation (or), ‘??’ special sequence. Note that any visible symbol is accepted for sequence and annotation alphabets, except the special symbols ‘#’, ‘=’, ‘∼’ and ‘>’. (B) Recommendations for the AMSA 1.0 format. An AMSA file consists of aligned sequences and annotation represented in FASTA format. Is possible to precede the sequence and annotation block with a version header followed by general comments (lines starting with '#') and global annotation (lines starting with ‘##’ followed by a key = text); in this case, we must terminate the sequence and annotation block using symbols ‘#//’. Any ad hoc alphabet can be used to describe sequences and annotation. The meaning of the alphabet used by an annotation layer can be explicitly represented as pairs value ∼ symbol in its description field. Sequence- (or annotation)-specific annotation is added after the ID field using key = value (or key = ‘text with white-spaces’). Sequence residues are annotated using a cross-reference to an annotation layer (key = # residue_annotation_ID). Though not required by the specification, we suggest to include the reference sequence ID in the residue_annotation_ID to facilitate the visual mapping from annotation to sequence. Columns of the MSA are annotated using annotation layers not linked to one or multiple sequences (see Figure 2 for an example). Two keywords are reserved in the current format: ‘DEFAULT’ is used to set a default value for the ‘-’ symbol of the annotation sequence; ‘SYM_LEN’ defines the number of characters required to encode a symbol. The SYM_LEN = digits permits to extend the format of sequences and annotations to any alphabet, e.g. codons (SYM_LEN = 3), to store PDB coordinates (SYM_LEN = 8 to store up to 6 digits, dot and sign), etc. If SYM_LEN is not specified, its default value should be 1 (one character per symbol). Note that we can achieve the same result by encoding multi-character symbols on a corresponding number of layers, this having the advantage that standard MSA editors can accept the format.
Figure 2.
Figure 2.
(A) Example of an AMSA. Protein sequences containing symbols from the amino acid alphabet and ‘-’ to represent gaps are represented in the standard MSA format at the beginning of the file. Annotation layers (layers beginning with ‘#’) are in the second part of the AMSA file and they contain any ad hoc alphabet. The meaning of each symbol of the alphabet is described by the symbol ∼ value pairs in the description field of each layer. Note that the symbol ‘-’ in annotation denotes a default value described by the DEFAULT key. Annotation attached to the individual sequences is represented as key = value pairs (e.g. weight of the sequence). Annotation attached to individual residues of a sequence is described in a cross-referenced layer: in the example the ‘ss_layer’ cross-reference of THM1_THADA points to the secondary structure of the protein in layer #_SS_THM1_THADA. MSA column annotation is used to label disulfide bridges (#_SITE_DISULFIDE) where numbers represent the cysteine couple and values represent the expected symbol (C for cysteine in this example). The remaining layers represent parameters of the profile: the topology of the model (#_LABEL_), the two matrices used to generate pseudocounts (layers #_MATRIX_ and #_MATRIX_2) with the respective position-specific weights of the pseudocounts. (B) The same alignment viewed in Jalview.
Figure 2.
Figure 2.
(A) Example of an AMSA. Protein sequences containing symbols from the amino acid alphabet and ‘-’ to represent gaps are represented in the standard MSA format at the beginning of the file. Annotation layers (layers beginning with ‘#’) are in the second part of the AMSA file and they contain any ad hoc alphabet. The meaning of each symbol of the alphabet is described by the symbol ∼ value pairs in the description field of each layer. Note that the symbol ‘-’ in annotation denotes a default value described by the DEFAULT key. Annotation attached to the individual sequences is represented as key = value pairs (e.g. weight of the sequence). Annotation attached to individual residues of a sequence is described in a cross-referenced layer: in the example the ‘ss_layer’ cross-reference of THM1_THADA points to the secondary structure of the protein in layer #_SS_THM1_THADA. MSA column annotation is used to label disulfide bridges (#_SITE_DISULFIDE) where numbers represent the cysteine couple and values represent the expected symbol (C for cysteine in this example). The remaining layers represent parameters of the profile: the topology of the model (#_LABEL_), the two matrices used to generate pseudocounts (layers #_MATRIX_ and #_MATRIX_2) with the respective position-specific weights of the pseudocounts. (B) The same alignment viewed in Jalview.

Similar articles

See all similar articles

Cited by 299 PubMed Central articles

See all "Cited by" articles

References

    1. Sigrist CJA, Cerutti L, Hulo N, Gattiker A, Falquet L, Pagni M, Bairoch A, Bucher P. PROSITE: a documented database using patterns and profiles as motif descriptors. Brief Bioinformatics. 2002;3:265–274. - PubMed
    1. Sigrist CJA, de Castro E, Langendijk-Genevaux PS, Le Saux V, Bairoch A, Hulo N. ProRule: a new database containing functional and structural information on PROSITE profiles. Bioinformatics. 2005;21:4060–4066. - PubMed
    1. de Castro E, Sigrist CJA, Gattiker A, Bulliard V, Langendijk-Genevaux PS, Gasteiger E, Bairoch A, Hulo N. ScanProsite: detection of PROSITE signature matches and ProRule-associated functional and structural residues in proteins. Nucleic Acids Res. 2006;34:W362–W365. - PMC - PubMed
    1. Hulo N, Bairoch A, Bulliard V, Cerutti L, Cuche BA, de Castro E, Lachaize C, Langendijk-Genevaux PS, Sigrist CJA. The 20 years of PROSITE. Nucleic Acids Res. 2008;36:D245–D249. - PMC - PubMed
    1. Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz H, Ceric G, Forslund K, Eddy SR, Sonnhammer ELL, et al. The Pfam protein families database. Nucleic Acids Res. 2008;36:D281–D288. - PMC - PubMed

Publication types

Feedback