Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 May 7;11(5):e1004226.
doi: 10.1371/journal.pcbi.1004226. eCollection 2015 May.

Sparse and Compositionally Robust Inference of Microbial Ecological Networks

Free PMC article

Sparse and Compositionally Robust Inference of Microbial Ecological Networks

Zachary D Kurtz et al. PLoS Comput Biol. .
Free PMC article


16S ribosomal RNA (rRNA) gene and other environmental sequencing techniques provide snapshots of microbial communities, revealing phylogeny and the abundances of microbial populations across diverse ecosystems. While changes in microbial community structure are demonstrably associated with certain environmental conditions (from metabolic and immunological health in mammals to ecological stability in soils and oceans), identification of underlying mechanisms requires new statistical tools, as these datasets present several technical challenges. First, the abundances of microbial operational taxonomic units (OTUs) from amplicon-based datasets are compositional. Counts are normalized to the total number of counts in the sample. Thus, microbial abundances are not independent, and traditional statistical metrics (e.g., correlation) for the detection of OTU-OTU relationships can lead to spurious results. Secondly, microbial sequencing-based studies typically measure hundreds of OTUs on only tens to hundreds of samples; thus, inference of OTU-OTU association networks is severely under-powered, and additional information (or assumptions) are required for accurate inference. Here, we present SPIEC-EASI (SParse InversE Covariance Estimation for Ecological Association Inference), a statistical method for the inference of microbial ecological networks from amplicon sequencing datasets that addresses both of these issues. SPIEC-EASI combines data transformations developed for compositional data analysis with a graphical model inference framework that assumes the underlying ecological association network is sparse. To reconstruct the network, SPIEC-EASI relies on algorithms for sparse neighborhood and inverse covariance selection. To provide a synthetic benchmark in the absence of an experimentally validated gold-standard network, SPIEC-EASI is accompanied by a set of computational tools to generate OTU count data from a set of diverse underlying network topologies. SPIEC-EASI outperforms state-of-the-art methods to recover edges and network properties on synthetic data under a variety of scenarios. SPIEC-EASI also reproducibly predicts previously unknown microbial associations using data from the American Gut project.

Conflict of interest statement

The authors have declared that no competing interests exist.


Fig 1
Fig 1. Conditional independence vs correlation analysis for a toy dataset.
In an ecosystem, the abundance of any OTU is potentially dependent on the abundances of other OTUs in the ecological network. Here, we simulate abundances from a network where OTU 3 directly influences (via some set of biological mechanisms) the abundances of OTUs 1, 2 and 4 (a). The inference goal here is to recover the underlying network from the simulated data. b) Absolute abundances of these four OTUs were drawn from a negative-binomial distribution across 500 samples according to the true network (as described in the Methods section). c) Computing all pairwise Pearson correlation yields a symmetric matrix showing patterns of association (positive correlations are green and negative are red). We thresholded entries of the correlation matrix to generate relevance networks. d) A threshold at ρ ≥ ∣0.35∣ (represented by dashed and solid edges) results in a network in which OTU 3 is connected to all other OTUs with an additional connection between OTU 2 and OTU 4. A more stringent threshold at ρ ≥ ∣0.5∣, results in a sparser relevance network (notably missing the edge between OTU 3 and OTU 1), and is represented in d by solid edges only. Importantly, no single threshold recovers the true underlying hub topology. e) The inverse sample covariance matrix yields a symmetric matrix where entries are approximately zero if the corresponding OTU pairs are conditionally independent. The network (f) inferred from the non-zero entries (colored in blue in e) identifies the correct hub network. Thus, it is possible to choose a threshold for the sample inverse covariance that faithfully recovers the true network. Such a threshold is not guaranteed to exist for correlation or covariance (the metric used by SparCC and CCREPE). Intuitively, this is because simultaneous direct connections can induce strong correlations between nodes that do not have direct relationships (e.g. OTU 2-4). Conversely, weak correlations can arise between directly connected nodes (e.g. OTU 1-3). Although correlation is a useful measure of association in many contexts, it is a pairwise metric and therefore limited in a multivariate setting. On the other hand, SPIEC-EASI’s estimate of entries in the inverse covariance matrix depend on the conditional states of all available nodes. This feature helps SPIEC-EASI avoid detection of indirect network interactions.
Fig 2
Fig 2. Workflow of the SPIEC-EASI pipeline.
The SPIEC-EASI pipeline consists of two independent parts for a) synthetic data generation and b) network inference. a) Synthetic data generation requires an OTU count table and a user-selected network topology. Internally, the parameters of a statistical distribution (the zero-inflated Negative binomial model is suggested) are fit to the OTU marginals of the real data, and are combined with the randomly-generated network in the Normal to Anything (NORTA) approach to generate correlated count data. b) Network inference proceeds in three stages on synthetic or real OTU count data: First, data is pre-procssed and centered log-ratio (CLR) transformed to ensure compositional robustness. Next, the user selects one of two graphical model inference procedures: 1) Neighborhood selection (the MB method) or 2) inverse covariance selection (the glasso method). SPIEC-EASI network inference assumes that the underlying network is sparse. We infer the correct model sparseness by the Stability Approach to Regularization Selection (StARS), which involves random subsampling of the dataset to find a network with low variability in the selected set of edges. SPIEC-EASI outputs include an ecological network (from the non-zero entries of the inverse covariance network) and an invertible covariance matrix. If the network was inferred from synthetic data, it can be compared with the input network to assess inference quality.
Fig 3
Fig 3. a)Bivariate illustration of the NorTA approach.
First normal data, incorporating the target correlation structure, is generated. Uniform data are then generated for each margin via the normal density function. These is then converted to an arbitrary marginal distribution (Poisson and Zero-inflated Negative Binomial shown as examples) via its quantile function. To generate realistic synthetic data, parameters for these margins are fit to real data. b) Examples of band-like, cluster, and scale-free network topologies
Fig 4
Fig 4. Precision-recall performance on synthetic datasets.
a) Red = S-E(glasso), orange = S-E(MB), purple = SparCC, blue = CCREPE, green = Pearson correlation, black = random. Area under precision-recall (AUPR) vs. number of samples n for different κ values are depicted. Bars represent average over 20 synthetic datasets, and error bars represent standard error. Asterisks denote conditions under which SPIEC-EASI methods had significantly higher AUPR relative to all other control methods (P<0.05 for all one-sided T tests). b) Representative precision-recall curves for p = 68, n = 102, κ = 100; solid and dashed lines denote SPIEC-EASI and control methods, respectively.
Fig 5
Fig 5. a) Predicted degree distributions (colored) are overlaid with the true degree distribution (white) for n = 1360 samples, p = 205 OTUs, κ = 100.
Lighter shades correspond to regions of overlap between predicted and true distributions. Dissimilarity between the distributions is measured by KL divergence, D KL. b) Bars represent the average D KL over three independent sets of synthetic datasets (7 datasets per set); error bars represent standard error. Divergences were compared between S-E and control methods using one-sided T-tests; ***, **, * correspond to P<0.001, 0.01, and 0.05.
Fig 6
Fig 6. a) Network reproducibility for inference methods (see main text for details).
Bars represent mean Hamming distance, and errorbars are 95% confidence intervals. b) Visualization of edge overlap between networks inferred with SPIEC-EASI, SparCC, and CCREPE. c) Network visualizations with OTU nodes colored by Family lineage (or Order, when the Family of the OTU is unknown), edges are colored by sign (positive: green, negative: red), and the node diameter proportional to the geometric mean of that OTU’s relative abundance.

Similar articles

See all similar articles

Cited by 160 articles

See all "Cited by" articles


    1. Gilbert J, Meyer F, Jansson J, Gordon J, Pace N, et al. (2010) The earth microbiome project: Meeting report of the “1st emp meeting on sample selection and acquisition” at argonne national laboratory october 6th 2010. Standards in Genomic Sciences 3. - PMC - PubMed
    1. Turnbaugh PJ, Ley RE, Hamady M, Fraser-Liggett C, Knight R, et al. (2007) The human microbiome project: exploring the microbial part of ourselves in a changing world. Nature 449: 804 10.1038/nature06244 - DOI - PMC - PubMed
    1. AmGut. The american gut project. Accessed: 2014-01-30.
    1. Bunge J, Willis A, Walsh F (2014) Estimating the number of species in microbial diversity studies. Annual Review of Statistics and Its Application 1: 427–445. 10.1146/annurev-statistics-022513-115654 - DOI
    1. Foster JA, Krone SM, Forney LJ (2008) Application of ecological network theory to the human microbiome. Interdisciplinary perspectives on infectious diseases 2008: 839501 10.1155/2008/839501 - DOI - PMC - PubMed

Publication types