The accuracy of absolute differential abundance analysis from relative count data

PLoS Comput Biol. 2022 Jul 11;18(7):e1010284. doi: 10.1371/journal.pcbi.1010284. eCollection 2022 Jul.

Abstract

Concerns have been raised about the use of relative abundance data derived from next generation sequencing as a proxy for absolute abundances. For example, in the differential abundance setting, compositional effects in relative abundance data may give rise to spurious differences (false positives) when considered from the absolute perspective. In practice however, relative abundances are often transformed by renormalization strategies intended to compensate for these effects and the scope of the practical problem remains unclear. We used simulated data to explore the consistency of differential abundance calling on renormalized relative abundances versus absolute abundances and find that, while overall consistency is high, with a median sensitivity (true positive rates) of 0.91 and specificity (1-false positive rates) of 0.89, consistency can be much lower where there is widespread change in the abundance of features across conditions. We confirm these findings on a large number of real data sets drawn from 16S metabarcoding, expression array, bulk RNA-seq, and single-cell RNA-seq experiments, where data sets with the greatest change between experimental conditions are also those with the highest false positive rates. Finally, we evaluate the predictive utility of summary features of relative abundance data themselves. Estimates of sparsity and the prevalence of feature-level change in relative abundance data give reasonable predictions of discrepancy in differential abundance calling in simulated data and can provide useful bounds for worst-case outcomes in real data.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.
  • Research Support, Non-U.S. Gov't

MeSH terms

  • High-Throughput Nucleotide Sequencing*
  • RNA-Seq

Grants and funding

The authors would like to acknowledge funding from Human Frontier Science Program grant HFSP RGP005 (to SM), National Science Foundation grants NSF DMS 17-13012 (to SM), NSF BCS 1552848 (to SM), NSF DBI 1661386 (to SM), NSF IIS 15-46331 (to SM), and NSF DMS 16-13261 (to SM), as well as funding from the North Carolina Biotechnology Center through grants 2016-IDG-1013 (to KR, SM) and 2020-IIG-2109 (to KR, SM). The authors would also like to acknowledge funding through a graduate fellowship provided by the Duke Forge health data science center (to KR). No funders played any role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.