Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Aug 19:10:385.
doi: 10.1186/1471-2164-10-385.

Microbial comparative pan-genomics using binomial mixture models

Affiliations
Free PMC article

Microbial comparative pan-genomics using binomial mixture models

Lars Snipen et al. BMC Genomics. .
Free PMC article

Abstract

Background: The size of the core- and pan-genome of bacterial species is a topic of increasing interest due to the growing number of sequenced prokaryote genomes, many from the same species. Attempts to estimate these quantities have been made, using regression methods or mixture models. We extend the latter approach by using statistical ideas developed for capture-recapture problems in ecology and epidemiology.

Results: We estimate core- and pan-genome sizes for 16 different bacterial species. The results reveal a complex dependency structure for most species, manifested as heterogeneous detection probabilities. Estimated pan-genome sizes range from small (around 2600 gene families) in Buchnera aphidicola to large (around 43000 gene families) in Escherichia coli. Results for Echerichia coli show that as more data become available, a larger diversity is estimated, indicating an extensive pool of rarely occurring genes in the population.

Conclusion: Analyzing pan-genomics data with binomial mixture models is a way to handle dependencies between genomes, which we find is always present. A bottleneck in the estimation procedure is the annotation of rarely occurring genes.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Mixture model example. An illustration of a three component binomial mixture model when G = 10. The upper left panel shows the binomial probability mass function (PMF, red) for the detection probability ρ1 = 1.0, i.e. the core component. In the upper right panel a second component has a binomial PMF (green) where ρ2 = 0.85, and in the lower left panel a third component (blue) has ρ3 = 0.05. The lower right panel shows their combination into 11 multinomial probabilities, using mixing proportions π1 = 0.2, π2 = 0.1 and π3 = 0.7.
Figure 2
Figure 2
Genomes and their core- and pan-genomes. Number of genomes refer to completed genomes at NCBI [13] at the end of January 2009. Sample core, Median size and Sample pan are the observed quantities, while Mixture core, Chao pan and Mixture pan are estimated quantities. Components is the optimal choice of mixture components. The black bars under Coverage indicate pan-genome coverage, i.e. the current sample pan-genome size as a fraction of the estimated pan-genome size (Mixture pan).
Figure 3
Figure 3
Core- and pan-genome size estimates. Observations and estimates of core- and pan-genome sizes. The horisontal axis is on log2 scale. Solid blue markers represent the observed data; squares are the core genes, circles are the median number of genes for an individual genome, and the triangles are the total number of gene families found in the data set. The red "+" represents the estimated core size, whilst the red "x" is the estimated size of the pan-genome using the binomial mixture model. The red "c" is the Chao lower-bound estimate of pan-size. The bars represents a 90% naive bootstrap confidence interval for the pan-genome, giving a rough indication of uncertainty.
Figure 4
Figure 4
Estimated mixture models. Graphical display of binomial mixture models. Each rectangle corresponds to a component, its width indicates its mixing proportion and its color indicates its detection probability (see color bar). Red areas indicate parts of the pan-genome with a small detection probability, i.e. rarely occurring genes, whilst regions towards the blue end of the scale represent conserved genes – that is, genes shared by most of the genomes.
Figure 5
Figure 5
Effect of growing E. coli data set. Sample (black) and estimated population (red and blue) pan-genomes sizes for E. coli, as a function of number of genomes sampled. In blue is our mixture-model estimate, in red the Chao lower-bound estimate and the black is the observed size. All of these values are averages over 22 data sets. Note that for the lower number of genomes, the estimates tend to have larger variability, due to the larger number of ways to sample a small number of genomes out of a pool of 22 genomes; at the other end of the scale, the 22 possible combinations of 21 genomes are very similar to each other.

Similar articles

Cited by

References

    1. Read TD, Ussery DW. Opening the pan-genomics box. Current Opinion in Microbiology. 2006;9 doi: 10.1016/j.mib.2006.08.010. - DOI
    1. Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, Angiuoli SV, Crabtree J, Jones AJ, Durkin AS, DeBoy RT, Davidsen TM, Mora M, Scarselli M, y Ros IM, Peterson JD, Hauser CR, Sundaram JP, Nelson WC, Madupu R, Brinkac LM, Dodson RJ, Rosovitz MJ, Sullivan SA, Daugherty SC, Haft DH, Selengut J, Gwinn ML, Zhou L, Zafar N, Khouri H, Radune D, Dimitrov G, Watkins K, OConnor KJB, Smith S, Utterback TR, White O, Rubens EC, Grandi G, Madoff LC, Kasper DL, Telford JL, Wessels MR, Rappuoli R, Fraser CM. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: Implications for the microbial pan-genome. PNAS. 2005;102:13950–13955. doi: 10.1073/pnas.0506758102. - DOI - PMC - PubMed
    1. Chen S, Hung C, Xu J, Reigstad C, Magrini V, Sabo A, Blasiar D, Bieri T, Meyer R, Ozersky P, Armstrong J, Fulton R, Latreille J, Spieth J, Hooton T, Merdis E, Hultgren S, Gordon J. Identification of genes subject to positive selection in uropathogenic strains of Escherichia coli: A comparative genomics approach. PNAS. 2006;103(15):5977–5982. doi: 10.1073/pnas.0600938103. - DOI - PMC - PubMed
    1. Rasko D, Rosovitz GMJ, Myers, Mongodin E, Fricke W, Gajer P, Crabtree J, Sebaihia M, Thomson N, Chaudhuri R, Henderson I, Sperandio V, Ravel J. The Pangenome Structure of Escherichia coli: Comparative Genomic Analysis of E. coli Commensal and Pathogenic Isolates. Journal of Bacteriology. 2008;190(20):6881–6893. doi: 10.1128/JB.00619-08. - DOI - PMC - PubMed
    1. Willenbrock H, Hallin PF, Wassenaar TM, Ussery DW. Characterization of probiotic Escherichia coli isolates with a novel pan-genome microarray. Genome Biology. 2007;8 doi: 10.1186/gb-2007-8-12-r267. - DOI - PMC - PubMed

LinkOut - more resources