TphPMF: A microbiome data imputation method using hierarchical Bayesian Probabilistic Matrix Factorization

PLoS Comput Biol. 2025 Mar 11;21(3):e1012858. doi: 10.1371/journal.pcbi.1012858. eCollection 2025 Mar.

Abstract

In microbiome research, data sparsity represents a prevalent and formidable challenge. Sparse data not only compromises the accuracy of statistical analyses but also conceals critical biological relationships, thereby undermining the reliability of the conclusions. To tackle this issue, we introduce a machine learning approach for microbiome data imputation, termed TphPMF. This technique leverages Probabilistic Matrix Factorization, incorporating phylogenetic relationships among microorganisms to establish Bayesian prior distributions. These priors facilitate posterior predictions of potential non-biological zeros. We demonstrate that TphPMF outperforms existing microbiome data imputation methods in accurately recovering missing taxon abundances. Furthermore, TphPMF enhances the efficacy of certain differential abundance analysis methods in detecting differentially abundant (DA) taxa, particularly showing advantages when used in conjunction with DESeq2-phyloseq. Additionally, TphPMF significantly improves the precision of cross-predicting disease conditions in microbiome datasets pertaining to type 2 diabetes and colorectal cancer.

MeSH terms

  • Algorithms
  • Bayes Theorem
  • Colorectal Neoplasms / microbiology
  • Computational Biology / methods
  • Diabetes Mellitus, Type 2 / microbiology
  • Humans
  • Machine Learning
  • Microbiota* / genetics
  • Phylogeny