Zero-inflated generalized Dirichlet multinomial regression model for microbiome compositional data analysis

Biostatistics. 2019 Oct 1;20(4):698-713. doi: 10.1093/biostatistics/kxy025.

Abstract

There is heightened interest in using high-throughput sequencing technologies to quantify abundances of microbial taxa and linking the abundance to human diseases and traits. Proper modeling of multivariate taxon counts is essential to the power of detecting this association. Existing models are limited in handling excessive zero observations in taxon counts and in flexibly accommodating complex correlation structures and dispersion patterns among taxa. In this article, we develop a new probability distribution, zero-inflated generalized Dirichlet multinomial (ZIGDM), that overcomes these limitations in modeling multivariate taxon counts. Based on this distribution, we propose a ZIGDM regression model to link microbial abundances to covariates (e.g. disease status) and develop a fast expectation-maximization algorithm to efficiently estimate parameters in the model. The derived tests enable us to reveal rich patterns of variation in microbial compositions including differential mean and dispersion. The advantages of the proposed methods are demonstrated through simulation studies and an analysis of a gut microbiome dataset.

Keywords: Compositional data analysis; Differential abundance; Hierarchical model; Microbiome; Score test; Zero-inflated model.

MeSH terms

  • Biostatistics / methods*
  • Data Analysis*
  • Humans
  • Microbiota*
  • Models, Statistical*