Robust and automatic definition of microbiome states

PeerJ. 2019 Mar 26:7:e6657. doi: 10.7717/peerj.6657. eCollection 2019.

Abstract

Analysis of microbiome dynamics would allow elucidation of patterns within microbial community evolution under a variety of biologically or economically important circumstances; however, this is currently hampered in part by the lack of rigorous, formal, yet generally-applicable approaches to discerning distinct configurations of complex microbial populations. Clustering approaches to define microbiome "community state-types" at a population-scale are widely used, though not yet standardized. Similarly, distinct variations within a state-type are well documented, but there is no rigorous approach to discriminating these more subtle variations in community structure. Finally, intra-individual variations with even fewer differences will likely be found in, for example, longitudinal data, and will correlate with important features such as sickness versus health. We propose an automated, generic, objective, domain-independent, and internally-validating procedure to define statistically distinct microbiome states within datasets containing any degree of phylotypic diversity. Robustness of state identification is objectively established by a combination of diverse techniques for stable cluster verification. To demonstrate the efficacy of our approach in detecting discreet states even in datasets containing highly similar bacterial communities, and to demonstrate the broad applicability of our method, we reuse eight distinct longitudinal microbiome datasets from a variety of ecological niches and species. We also demonstrate our algorithm's flexibility by providing it distinct taxa subsets as clustering input, demonstrating that it operates on filtered or unfiltered data, and at a range of different taxonomic levels. The final output is a set of robustly defined states which can then be used as general biomarkers for a wide variety of downstream purposes such as association with disease, monitoring response to intervention, or identifying optimally performant populations.

Keywords: Clustering; Longitudinal dataset; Machine Learning; Metagenomics; Microbiome; Sub-states.

Grants and funding

Mark D. Wilkinson is funded by the Ministerio de Economía y Competitividad grant number TIN2014-55993-RM, and by the Isaac Peral programme of UPM. Beatriz García-Jiménez is funded through an award from the Severo Ochoa programme of the CBGP UPM-INIA Severo Ochoa Center of Excellence, Madrid. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.