Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012;7(6):e29044.
doi: 10.1371/journal.pone.0029044. Epub 2012 Jun 13.

A Case Study for Large-Scale Human Microbiome Analysis Using JCVI's Metagenomics Reports (METAREP)

Free PMC article

A Case Study for Large-Scale Human Microbiome Analysis Using JCVI's Metagenomics Reports (METAREP)

Johannes Goll et al. PLoS One. .
Free PMC article


As metagenomic studies continue to increase in their number, sequence volume and complexity, the scalability of biological analysis frameworks has become a rate-limiting factor to meaningful data interpretation. To address this issue, we have developed JCVI Metagenomics Reports (METAREP) as an open source tool to query, browse, and compare extremely large volumes of metagenomic annotations. Here we present improvements to this software including the implementation of a dynamic weighting of taxonomic and functional annotation, support for distributed searches, advanced clustering routines, and integration of additional annotation input formats. The utility of these improvements to data interpretation are demonstrated through the application of multiple comparative analysis strategies to shotgun metagenomic data produced by the National Institutes of Health Roadmap for Biomedical Research Human Microbiome Project (HMP) ( Specifically, the scalability of the dynamic weighting feature is evaluated and established by its application to the analysis of over 400 million weighted gene annotations derived from 14 billion short reads as predicted by the HMP Unified Metabolic Analysis Network (HUMAnN) pipeline. Further, the capacity of METAREP to facilitate the identification and simultaneous comparison of taxonomic and functional annotations including biological pathway and individual enzyme abundances from hundreds of community samples is demonstrated by providing scenarios that describe how these data can be mined to answer biological questions related to the human microbiome. These strategies provide users with a reference of how to conduct similar large-scale metagenomic analyses using METAREP with their own sequence data, while in this study they reveal insights into the nature and extent of variation in taxonomic and functional profiles across body habitats and individuals. Over one thousand HMP WGS datasets and the latest open source code are available at

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.


Figure 1
Figure 1. Screenshot of the METAREP Compare Page.
The Compare page allows users to filter, compare and visualize annotation attributes across multiple datasets. As illustrated in the upper panel, the user can find and select datasets of interest (here pooled body habitats were selected). The middle panel illustrates filter and compare options (here datasets were filtered for the pyruvate dehydrogenase complex and the heatmap plot option was selected). The bottom panel shows the compare results and allows users to switch between annotation attributes and specify its level of granularity (here the taxonomy attribute and phylum level were selected).
Figure 2
Figure 2. Heatmap plots of three enzymatic markers.
Marker abundance is contrasted across phyla (columns) and body habitats (rows) using Morisita-Horn distances in combination with the average linkage clustering method. Colors encode the relative abundance of the selected feature-dataset combination (dark red 0% to white 100%) while the dendograms at the top and left show annotation feature and dataset differences, respectively.
Figure 3
Figure 3. Hierarchical cluster plots of 48 samples taken from 12 females and 12 males at two different time points.
Hierarchical clustering analysis of a random subset of human microbiome samples taken from five human body regions clustered by NCBI taxonomy at the family level (a) and by KEGG pathways (b). Clusters were generated by the average linkage clustering method using the Morisita-Horn index to generate a distance matrix (shown on the x-axis). Dataset labels encode the following information [donor ID]-[habitat]-[gender]-[time point]-[sample ID]-[annotation-type]. For example, the dataset label 159814214-an-m-2-SRS047225-mtr encodes a sample from a male donor (ID 159814214) taken from the anterior nares site at time point 2 with sample ID (SRS047225) annotated by the metabolic reconstruction (HUMAnN) pipeline (mtr). The dotted line represents the level at which the tree was cut for analysis. The resulting clusters are labeled as follows: AN (anterior nares), BM (buccal mucosa), SP (supragingival plaque), ST (stool), and PF (posterior fornix).
Figure 4
Figure 4. Screenshots of METAREP statistical result panels.
List of phyla and pathways that are differentially abundant between the buccal mucosa (n = 116) and supragingival plague (n = 89) habitats. Taxonomic differences reported by Metastats with confidence intervals (formula image) shown in (a), differences in KEGG pathway abundance detected by the Wilcoxon rank-sum test are shown in (b).
Figure 5
Figure 5. Software architecture overview.
The METAREP software integrates several open source tools to import, store and analyze metagenomics annotations. Users can analyze stored data using a variety of web based tools. A subset of the web functionality is available via a programmatic access module which allows data retrieval directly from the MySQL database and Lucene index files.
Figure 6
Figure 6. Comparison of query response time for two weighted search approaches.
Each data point marks the query response time (y axis) for a query that returned x number of entries (x axis). The blue line indicates the linear fit for the weighted search approach while the red line indicates the linear fit for the distributed weighted search approach. Parameter estimations for the linear regression models are given in the boxes above the fitted lines.

Similar articles

See all similar articles

Cited by 7 articles

See all "Cited by" articles


    1. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science. 2004;304:66– 74. - PubMed
    1. Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ. The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families. PLoS Biol. 2007;5:e16. - PMC - PubMed
    1. Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S. The Sorcerer II Global Ocean Sampling expedition: northwest Atlantic through eastern tropical Pacific. PLoS Biol. 2007;5:e77. - PMC - PubMed
    1. Cardenas E, Wu WM, Leigh MB, Carley J, Carroll S. Significant association between sulfate-reducing bacteria and uranium-reducing microbial communities as revealed by a combined massively parallel sequencing-indicator species approach. Appl Environ Microbiol. 2010;76:6778– 6786. - PMC - PubMed
    1. Bertin PN, Heinrich-Salmeron A, Pelletier E, Goulhen-Chollet F, Arsène-Ploetze F. Metabolic diversity among main microorganisms inside an arsenic-rich ecosystem revealed by meta-and proteo-genomics. ISME J. 2011. - PMC - PubMed

Publication types

LinkOut - more resources