Background: Advances in DNA sequencing have offered researchers an unprecedented opportunity to better study the variety of species living in and on the human body. However, the analysis of microbiome data is complicated by several challenges. First, the sequencing depth may vary by orders of magnitude across samples. Second, species are rare and the data often contain many zeros. Third, the specimen is a fraction of the microbial ecosystem, and so the data are compositional carrying only relative information. Other characteristics of microbiome data include pronounced over-dispersion in taxon abundances, and the existence of a phylogenetic tree that relates all bacterial species. To address some of these challenges, microbiome analysis workflows often normalize the read counts prior to downstream analysis. However, there are limitations in the current literature on the normalization of microbiome data.
Results: Under the multinomial distribution for the read counts and a prior for the unknown proportions, we propose an empirical Bayes approach to microbiome data normalization. Using a tree-based extension of the Dirichlet prior, we further extend our method by incorporating the phylogenetic tree into the normalization process. We study the impact of normalization on differential abundance analysis. In the presence of tree structure, we propose a phylogeny-aware detection procedure.
Conclusions: Extensive simulations and gut microbiome data applications are conducted to demonstrate the superior performance of our empirical Bayes method over other normalization methods, and over commonly-used methods for differential abundance testing. Original R scripts are available at GitHub (https://github.com/liudoubletian/eBay).
Keywords: Bayesian shrinkage; Differentially abundant OTUs; MetagenomeSeq; Phylogeny-aware analysis; Rarefying.