Inferring correlation networks from genomic survey data
- PMID: 23028285
- PMCID: PMC3447976
- DOI: 10.1371/journal.pcbi.1002687
Inferring correlation networks from genomic survey data
Abstract
High-throughput sequencing based techniques, such as 16S rRNA gene profiling, have the potential to elucidate the complex inner workings of natural microbial communities - be they from the world's oceans or the human gut. A key step in exploring such data is the identification of dependencies between members of these communities, which is commonly achieved by correlation analysis. However, it has been known since the days of Karl Pearson that the analysis of the type of data generated by such techniques (referred to as compositional data) can produce unreliable results since the observed data take the form of relative fractions of genes or species, rather than their absolute abundances. Using simulated and real data from the Human Microbiome Project, we show that such compositional effects can be widespread and severe: in some real data sets many of the correlations among taxa can be artifactual, and true correlations may even appear with opposite sign. Additionally, we show that community diversity is the key factor that modulates the acuteness of such compositional effects, and develop a new approach, called SparCC (available at https://bitbucket.org/yonatanf/sparcc), which is capable of estimating correlation values from compositional data. To illustrate a potential application of SparCC, we infer a rich ecological network connecting hundreds of interacting species across 18 sites on the human body. Using the SparCC network as a reference, we estimated that the standard approach yields 3 spurious species-species interactions for each true interaction and misses 60% of the true interactions in the human microbiome data, and, as predicted, most of the erroneous links are found in the samples with the lowest diversity.
Conflict of interest statement
The authors have declared that no competing interests exist.
Figures
, used in the simulations and observed in the HMP data are indicated on left indicates. As in Fig. 1, nodes represent OTUs, with size reflecting the OTU's average fraction in the community. Nodes represent OTUs, with size reflecting the OTU's average fraction in the community. Edges between nodes represent correlations between the nodes they connect, with edge width and shade indicating the correlation magnitude, and green and red colors indicating positive and negative correlations, respectively. For clarity, only edges corresponding to correlations whose magnitude is greater than 0.3 are drawn.
, and community diversity, as given by the Shannon entropy effective number of components
. SparCC errors are smaller than Pearson errors for all parameter values. For the maximal diversity plotted, 50 effective OTU, the inference error obtained using Pearson correlations is greatly decreased. Therefore, it is likely that Pearson correlations perform well on gene expression data, where the effective number of genes is typically in the hundreds or thousands. For each combination of density and diversity, multiple basis correlation networks were randomly generated, and corresponding data was sampled and used for correlation estimation. Dots labeled mid-vagina and gut indicate the average diversity observed in the mid-vagina and gut communities, and the density of their estimated correlation networks. Dots labeled 2D–I indicate the diversity and density used to generate the communities analyzed in Fig. 2.
Similar articles
-
CCLasso: correlation inference for compositional data through Lasso.Bioinformatics. 2015 Oct 1;31(19):3172-80. doi: 10.1093/bioinformatics/btv349. Epub 2015 Jun 4. Bioinformatics. 2015. PMID: 26048598 Free PMC article.
-
Direct interaction network inference for compositional data via codaloss.J Bioinform Comput Biol. 2020 Dec;18(6):2050037. doi: 10.1142/S0219720020500377. Epub 2020 Oct 27. J Bioinform Comput Biol. 2020. PMID: 33106076
-
A comparison of sequencing platforms and bioinformatics pipelines for compositional analysis of the gut microbiome.BMC Microbiol. 2017 Sep 13;17(1):194. doi: 10.1186/s12866-017-1101-8. BMC Microbiol. 2017. PMID: 28903732 Free PMC article.
-
Compositional data analysis of the microbiome: fundamentals, tools, and challenges.Ann Epidemiol. 2016 May;26(5):330-5. doi: 10.1016/j.annepidem.2016.03.002. Epub 2016 Mar 31. Ann Epidemiol. 2016. PMID: 27255738 Review.
-
PCR-based quantification of taxa-specific abundances in microbial communities: Quantifying and avoiding common pitfalls.J Microbiol Methods. 2018 Oct;153:139-147. doi: 10.1016/j.mimet.2018.09.015. Epub 2018 Sep 26. J Microbiol Methods. 2018. PMID: 30267718 Review.
Cited by
-
Plant flavones enrich rhizosphere Oxalobacteraceae to improve maize performance under nitrogen deprivation.Nat Plants. 2021 Apr;7(4):481-499. doi: 10.1038/s41477-021-00897-y. Epub 2021 Apr 8. Nat Plants. 2021. PMID: 33833418
-
Differential Oral Microbial Input Determines Two Microbiota Pneumo-Types Associated with Health Status.Adv Sci (Weinh). 2022 Nov;9(32):e2203115. doi: 10.1002/advs.202203115. Epub 2022 Aug 28. Adv Sci (Weinh). 2022. PMID: 36031410 Free PMC article.
-
Lateral root enriched Massilia associated with plant flowering in maize.Microbiome. 2024 Jul 9;12(1):124. doi: 10.1186/s40168-024-01839-4. Microbiome. 2024. PMID: 38982519 Free PMC article.
-
Dysbiosis and predicted function of dental and ruminal microbiome associated with bovine periodontitis.Front Microbiol. 2022 Aug 12;13:936021. doi: 10.3389/fmicb.2022.936021. eCollection 2022. Front Microbiol. 2022. PMID: 36033883 Free PMC article.
-
Exploring the interplay between running exercises, microbial diversity, and tryptophan metabolism along the microbiota-gut-brain axis.Front Microbiol. 2024 Jan 22;15:1326584. doi: 10.3389/fmicb.2024.1326584. eCollection 2024. Front Microbiol. 2024. PMID: 38318337 Free PMC article.
References
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
