Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012;7(10):e47656.
doi: 10.1371/journal.pone.0047656. Epub 2012 Oct 17.

MOCAT: A Metagenomics Assembly and Gene Prediction Toolkit

Affiliations
Free PMC article

MOCAT: A Metagenomics Assembly and Gene Prediction Toolkit

Jens Roat Kultima et al. PLoS One. .
Free PMC article

Abstract

MOCAT is a highly configurable, modular pipeline for fast, standardized processing of single or paired-end sequencing data generated by the Illumina platform. The pipeline uses state-of-the-art programs to quality control, map, and assemble reads from metagenomic samples sequenced at a depth of several billion base pairs, and predict protein-coding genes on assembled metagenomes. Mapping against reference databases allows for read extraction or removal, as well as abundance calculations. Relevant statistics for each processing step can be summarized into multi-sheet Excel documents and queryable SQL databases. MOCAT runs on UNIX machines and integrates seamlessly with the SGE and PBS queuing systems, commonly used to process large datasets. The open source code and modular architecture allow users to modify or exchange the programs that are utilized in the various processing steps. Individual processing steps and parameters were benchmarked and tested on artificial, real, and simulated metagenomes resulting in an improvement of selected quality metrics. MOCAT can be freely downloaded at http://www.bork.embl.de/mocat/.

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. The MOCAT data processing pipeline.
Metagenomic samples are collected and sequenced. The raw sequence reads are given as input to the pipeline, which are processed by modular steps resulting in metagenome assemblies and predicted genes. Arrows extending to the right from boxes, indicate input to various downstream analyses. Statistics from each step are summarized into multi-sheet Excel documents, as well as queryable SQLite databases.
Figure 2
Figure 2. Relative abundance of each reference genome present in the simulated metagenome.
The observed abundances by mapping reads to reference genomes and the expected abundance correlate with a Pearson correlation coefficient of 0.95 (base and read counts). Circles represent genomes with multiple strains from one species and squares represent genomes with only one strain within the species. All, but one, of the observations deviating from the diagonal are strains from the same species. These strains are either over- or under represented because reads are mapped to other closely related strains in addition to the strain of origin. Highlighted by dashed lines, are two examples where a high sequence similarity between strains (99.9% and 98.7% for the Synechococcus elongatus and Escherichia coli strains, respectively) can result in deviations from expected abundances.
Figure 3
Figure 3. Relative abundance of each genus present in the even HMP mock community.
The estimated abundances using qPCR and by mapping reads to reference genomes correlate with a Pearson correlation coefficient of 0.75 (base counts) and 0.83 (read counts).

Similar articles

See all similar articles

Cited by 77 articles

See all "Cited by" articles

References

    1. Goecks J, Nekrutenko A, Taylor J (2010) Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome biology 11: R86 doi:10.1186/gb-2010-11-8-r86. - PMC - PubMed
    1. Meyer F, Paarmann D, D’Souza M, Olson R, Glass EM, et al. (2008) The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC bioinformatics 9: 386 doi:10.1186/1471-2105-9-386. - PMC - PubMed
    1. Sun S, Chen J, Li W, Altintas I, Lin A, et al. (2011) Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis: the CAMERA resource. Nucleic acids research 39: D546–51 doi:10.1093/nar/gkq1102. - PMC - PubMed
    1. Markowitz VM, Chen I-MA, Chu K, Szeto E, Palaniappan K, et al. (2012) IMG/M: the integrated metagenome data management and comparative analysis system. Nucleic acids research 40: D123–9 doi:10.1093/nar/gkr975. - PMC - PubMed
    1. Angiuoli SV, Matalka M, Gussman A, Galens K, Vangala M, et al. (2011) CloVR: A virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC Bioinformatics 12: 356 doi:10.1186/1471-2105-12-356. - PMC - PubMed

Publication types

Grant support

This work was funded by EMBL, the European Community’s Seventh Framework Programme via the MetaHIT (HEALTH-F4-2007-201052), International Human Microbiome Standards (IHMS) (HEALTH-F4-2010-261376), and Cancerbiome (ERC Advanced Grant 268985) grants. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

LinkOut - more resources

Feedback