Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Feb 1;34(3):372-380.
doi: 10.1093/bioinformatics/btx549.

Temporal probabilistic modeling of bacterial compositions derived from 16S rRNA sequencing

Affiliations

Temporal probabilistic modeling of bacterial compositions derived from 16S rRNA sequencing

Tarmo Äijö et al. Bioinformatics. .

Abstract

Motivation: The number of microbial and metagenomic studies has increased drastically due to advancements in next-generation sequencing-based measurement techniques. Statistical analysis and the validity of conclusions drawn from (time series) 16S rRNA and other metagenomic sequencing data is hampered by the presence of significant amount of noise and missing data (sampling zeros). Accounting uncertainty in microbiome data is often challenging due to the difficulty of obtaining biological replicates. Additionally, the compositional nature of current amplicon and metagenomic data differs from many other biological data types adding another challenge to the data analysis.

Results: To address these challenges in human microbiome research, we introduce a novel probabilistic approach to explicitly model overdispersion and sampling zeros by considering the temporal correlation between nearby time points using Gaussian Processes. The proposed Temporal Gaussian Process Model for Compositional Data Analysis (TGP-CODA) shows superior modeling performance compared to commonly used Dirichlet-multinomial, multinomial and non-parametric regression models on real and synthetic data. We demonstrate that the nonreplicative nature of human gut microbiota studies can be partially overcome by our method with proper experimental design of dense temporal sampling. We also show that different modeling approaches have a strong impact on ecological interpretation of the data, such as stationarity, persistence and environmental noise models.

Availability and implementation: A Stan implementation of the proposed method is available under MIT license at https://github.com/tare/GPMicrobiome.

Contact: taijo@flatironinstitute.org or rb113@nyu.edu.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Statistical model and prior distributions. A graphical representation of our model. Grey and white circles depict observed variables and latent variables, respectively. Grey squares represent user-definable parameters. The Gaussian processes, G, model noise-free real-valued ‘compositions’ (log odds ratios), which are used as a basis for generating noisy real-valued ‘compositions’ (log odds ratios), F. Noisy compositions, Θ, are obtained from F by applying the softmax transformation. Zero-inflation-aware compositions, Θzi, are obtained from Θ and β by Θzi=Φ(Θ;β) [Equation (13)]. The likelihood of data is evaluated using the zero-inflation-aware composition parameters, Θzi. Underlying unobservable noise-free compositions, ΘG, are obtained from G by applying the softmax transformation
Fig. 2.
Fig. 2.
Temporal correlation in composition estimation. (a) Box plots illustrate estimation errors of our temporal TGP-CODA and DM models. 6, 9, 14 and 27 time points with 36 taxa are considered. Estimation error is defined to be the Euclidean distance between the the first M – 1 components of the simplex-valued proportions vectors. (b) Box plots illustrate the estimation error of the temporal and DM models at the time points with induced sampling zeros. The cases of 10, 20, 40 and 100 sampling zeros with 14 time points and 36 taxa are considered. Estimation error is defined to be the Euclidean distance between the the first M-1 of the simplex-valued proportions vectors. Each box plot is calculated from 100 simulations. Outliers are not depicted. The two-sided p-values from the Wilcoxon signed-rank tests are listed
Fig. 3.
Fig. 3.
Effect of sampling frequency on the estimation of bacterial order dynamics. (a) Dynamics of the proportions of Enterobacteriales (first row), Bacteroidales (second row), Sphingomonadales (third row) and Myxococcales (fourth row) in Subject A’s gut microbiota over time. The black circles are the posterior mean estimates, ΘG, from the temporal analysis. The filled regions show the 5 and 95% credible intervals. The semi-transparent circles depict the maximum likelihood estimates under the multinomial model. The orange curve is the LOWESS (α=0.05, which corresponds approximately to 20 days) estimate calculated from the maximum likelihood estimates. The time period where the subject was abroad and suffered from diarrhea are illustrated using the three shaded rectangles. (b) As in (a) but in the case when only every second time point is considered. (c) As in (a) but in the case when only every third time point is considered
Fig. 4.
Fig. 4.
Kinetics of Subject A’s gut microbiota. (a) Light gray and gray shaded regions are prior and posterior distributions of the length-scale parameter, respectively. The posterior distributions obtained in different analysis windows are illustrated separately (the days corresponding to each of the windows are listed in the titles). Posterior densities are estimated using Gaussian kernel density estimation (the Scott’s rule for estimating the bandwidth) on the pooled length-scale posterior samples over all the bacterial orders. (b) The posterior mean of the length-scale parameter and the corresponding standard deviations of Bifidobacteriales in different analysis windows (the window numbers correspond to the ones listed in (a)). (c) Dynamics of Bifidobacteriales in Subject A’s gut microbiota over time. The black circles are the posterior mean estimates, ΘG, from the temporal analysis. The filled regions show the 5 and 95% credible intervals. The semi-transparent circles depict the maximum likelihood estimates under the multinomial model. The time period where the subject was abroad and suffered from diarrhea are illustrated using the three shaded rectangles

Similar articles

Cited by

References

    1. Aach J., Church G.M. (2001) Aligning gene expression time series with time warping algorithms. Bioinformatics, 17, 495–508. - PubMed
    1. Ahdesmäki M. et al. (2007) Robust regression for periodicity detection in non-uniformly sampled time-course gene expression data. BMC Bioinformatics, 8, 233.. - PMC - PubMed
    1. Äijö T. et al. (2014) Methods for time series analysis of rna-seq data with application to human th17 cell differentiation. Bioinformatics, 30, i113–i120. - PMC - PubMed
    1. Aitchison J. (1982) The statistical analysis of compositional data. J. R. Stat. Soc. Ser. B (Methodological), 44, 139–177.
    1. Amann R.I. et al. (1990) Combination of 16s rrna-targeted oligonucleotide probes with flow cytometry for analyzing mixed microbial populations. Appl. Environ. Microbiol., 56, 1919–1925. - PMC - PubMed

Publication types

MeSH terms

Substances