Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Dec 26;2016:baw139.
doi: 10.1093/database/baw139. Print 2016.

Minimizing Proteome Redundancy in the UniProt Knowledgebase

Affiliations
Free PMC article

Minimizing Proteome Redundancy in the UniProt Knowledgebase

Borisas Bursteinas et al. Database (Oxford). .
Free PMC article

Abstract

Advances in high-throughput sequencing have led to an unprecedented growth in genome sequences being submitted to biological databases. In particular, the sequencing of large numbers of nearly identical bacterial genomes during infection outbreaks and for other large-scale studies has resulted in a high level of redundancy in nucleotide databases and consequently in the UniProt Knowledgebase (UniProtKB). Redundancy negatively impacts on database searches by causing slower searches, an increase in statistical bias and cumbersome result analysis. The redundancy combined with the large data volume increases the computational costs for most reuses of UniProtKB data. All of this poses challenges for effective discovery in this wealth of data. With the continuing development of sequencing technologies, it is clear that finding ways to minimize redundancy is crucial to maintaining UniProt's essential contribution to data interpretation by our users. We have developed a methodology to identify and remove highly redundant proteomes from UniProtKB. The procedure identifies redundant proteomes by performing pairwise alignments of sets of sequences for pairs of proteomes and subsequently, applies graph theory to find dominating sets that provide a set of non-redundant proteomes with a minimal loss of information. This method was implemented for bacteria in mid-2015, resulting in a removal of 50 million proteins in UniProtKB. With every new release, this procedure is used to filter new incoming proteomes, resulting in a more scalable and scientifically valuable growth of UniProtKB.Database URL: http://www.uniprot.org/proteomes/.

Figures

Figure 1.
Figure 1.
The workflow. The proteome redundancy pipeline consists of two modules: (A) Comparison and (B) redundancy removal. Each proteome is compared with every other proteome in its group. The results of the Comparison module form the input for the Redundancy removal module.
Figure 2.
Figure 2.
Redundancy graph reduction example. The Redundancy removal module generates redundancy graphs and finds a dominating set for each graph.
Figure 3.
Figure 3.
Taxonomic tree of proteomes for complete taxonomic space: before and after PRM. Proteomes were mapped before and after PRM taking into account the taxonomic relationships between the proteomes (panels A and B). Each node corresponds to a proteome, while a cluster represents a species grouping. The cluster size indicates the number of proteomes available for a particular species. The effect of PRM is clearly visible in the finer grained figure to the right (panel B).
Figure 4.
Figure 4.
Compared proteomes. Nearly 16 663 proteomes (60% of the total) appeared in above-threshold comparisons, i.e. at least 90% similar to another proteome of the same taxonomic group.
Figure 5.
Figure 5.
Percentages of redundant proteomes. The distribution of redundant proteomes is shown as a fraction of the Reference (13.65%) and all proteomes (58.2%), respectively.
Figure 6.
Figure 6.
UniProt Proteome search results. Querying by species names, taxonomy identifiers, organism codes as well as the newly introduced proteome identifiers is supported. When querying by species, one or more Reference proteomes are indicated at the top of the list followed by any other proteomes available. Redundant proteomes appear in grey and users are directed to an alternate non-redundant proteome available for the species.
Figure 7.
Figure 7.
Proteome page with opened download window. This figure shows the proteome page for redundant proteomes with opened download window. The sequences in UniParc can be downloaded in a range of formats using that download window.

Similar articles

See all similar articles

Cited by 6 articles

See all "Cited by" articles

References

    1. UniProt Consortium. (2015) UniProt: a hub for protein information. Nucleic Acids Res., 43(Database issue), 204–212. - PMC - PubMed
    1. Roach D., Burton J., Lee C. et al. (2015) A year of infection in the intensive care unit: prospective whole genome sequencing of bacterial clinical isolates reveals cryptic transmissions and novel microbiota. PLoS Genet., 11, e1005413.. - PMC - PubMed
    1. Lukjancenko O., Wassenaar T., Ussery D. (2010) Comparison of 61 sequenced Escherichia coli genomes. Microb. Ecol., 60, 708–720. - PMC - PubMed
    1. Richardson E., Limaye B., Inamdar H. et al. (2011) Genome sequences of Salmonella enterica serovar typhimurium, Choleraesuis, Dublin, and Gallinarum strains of well-defined virulence in food-producing animals. J. Bacteriol., 193, 3162–3163. - PMC - PubMed
    1. Zhang H., Li D., Zhao L. et al. (2013) Genome sequencing of 161 Mycobacterium tuberculosis isolates from China identifies genes and intergenic regions associated with drug resistance. Nat. Genet., 45, 1255–1260. - PubMed

Publication types

MeSH terms

Feedback