Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Oct 28;8(10):e77302.
doi: 10.1371/journal.pone.0077302. eCollection 2013.

PathogenFinder--distinguishing Friend From Foe Using Bacterial Whole Genome Sequence Data

Free PMC article

PathogenFinder--distinguishing Friend From Foe Using Bacterial Whole Genome Sequence Data

Salvatore Cosentino et al. PLoS One. .
Free PMC article

Erratum in

  • PLoS One. 2013;8(12). doi:10.1371/annotation/b84e1af7-c127-45c3-be22-76abd977600f


Although the majority of bacteria are harmless or even beneficial to their host, others are highly virulent and can cause serious diseases, and even death. Due to the constantly decreasing cost of high-throughput sequencing there are now many completely sequenced genomes available from both human pathogenic and innocuous strains. The data can be used to identify gene families that correlate with pathogenicity and to develop tools to predict the pathogenicity of newly sequenced strains, investigations that previously were mainly done by means of more expensive and time consuming experimental approaches. We describe PathogenFinder (, a web-server for the prediction of bacterial pathogenicity by analysing the input proteome, genome, or raw reads provided by the user. The method relies on groups of proteins, created without regard to their annotated function or known involvement in pathogenicity. The method has been built to work with all taxonomic groups of bacteria and using the entire training-set, achieved an accuracy of 88.6% on an independent test-set, by correctly classifying 398 out of 449 completely sequenced bacteria. The approach here proposed is not biased on sets of genes known to be associated with pathogenicity, thus the approach could aid the discovery of novel pathogenicity factors. Furthermore the pathogenicity prediction web-server could be used to isolate the potential pathogenic features of both known and unknown strains.

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.


Figure 1
Figure 1. Pratio and Z-score histograms for TM-Betaproteobacteria model.
The model was built setting MinOrg = 2, HT = 0.9 and LT = 0.3. (A) and (B) respectively show the Pratio and Z-score histograms for the clusters i such that ORGi≥MinOrg. By this step the original 69,744 clusters are reduced to 26,706. In (A) the bars at the extremes are the count for clusters containing either only genes from pathogenic organisms (right bar) and non-pathogenic ones (left bar), while the small pick in the middle are clusters containing the same number of pathogenic and non-pathogenic organisms, and hence will not be used since they provide no discriminative information about pathogenicity. (C) and (D) show the same histograms for the PFs obtained removing all the significant clusters with Pratio value between LT and HT. We can see how the amount of non-pathogenic PFs is higher than the pathogenic ones (C). HT and LT can be used to modify the amount of both pathogenic and non-pathogenic PFs, which can be useful in model in which the training-set has an unbalanced amount of pathogenic and non-pathogenic organisms. In (D) the negative Z-scores are associated with non-pathogenic families while the others are for pathogenic PFs.
Figure 2
Figure 2. PFDB, training and test-set for each model.
Each bar-plot shows the percentage of pathogenic (orange) and non-pathogenic (light-blue) organisms in the training and test-set, and the percentage of pathogenic and non-pathogenic protein families in the PFDB of the model identified by the title of the bar-plot (eg. WMD). Below each horizontal bar-plot the number of protein families composing the PFDB of the model the bar-plot refers to, along with its size in megabytes and the number of sequences, is shown.

Similar articles

See all similar articles

Cited by 84 articles

See all "Cited by" articles


    1. WHO The global burden of disease: 2004 update [cited 2013 sep 13]
    1. Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, et al. (2010) A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464: 59–65. - PMC - PubMed
    1. Hooper LV, Gordon JI (2001) Commensal host-bacterial relationships in the gut. Science 292: 1115–1118. - PubMed
    1. Sassetti CM, Boyd DH, Rubin EJ (2003) Genes required for mycobacterial growth defined by high density mutagenesis. Molecular Microbiology 48: 77–84. - PubMed
    1. Young RA, Mehra V, Sweetser D, Buchanan T, Clark-Curtiss J, et al. (1985) Genes for the major protein antigens of the leprosy parasite mycobacterium leprae. Nature 316: 450–452. - PubMed

Publication types

Grant support

This work was supported by the Center for Genomic Epidemiology ( at the Technical University of Denmark and was funded by grant 09-067103/DSF from the Danish Council for Strategic Research. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.