Analysis of the limited M. tuberculosis accessory genome reveals potential pitfalls of pan-genome analysis approaches

bioRxiv [Preprint]. 2024 Mar 25:2024.03.21.586149. doi: 10.1101/2024.03.21.586149.

Abstract

Pan-genome analysis is a fundamental tool in the study of bacterial genome evolution. Benchmarking the accuracy of pan-genome analysis methods is challenging, because it can be significantly influenced by both the methodology used to compare genomes, as well as differences in the accuracy and representativeness of the genomes analyzed. In this work, we curated a collection of 151 Mycobacterium tuberculosis (Mtb) isolates to evaluate sources of variability in pan-genome analysis. Mtb is characterized by its clonal evolution, absence of horizontal gene transfer, and limited accessory genome, making it an ideal test case for this study. Using a state-of-the-art graph-genome approach, we found that a majority of the structural variation observed in Mtb originates from rearrangement, deletion, and duplication of redundant nucleotide sequences. In contrast, we found that pan-genome analyses that focus on comparison of coding sequences (at the amino acid level) can yield surprisingly variable results, driven by differences in assembly quality and the softwares used. Upon closer inspection, we found that coding sequence annotation discrepancies were a major contributor to inflated Mtb accessory genome estimates. To address this, we developed panqc, a software that detects annotation discrepancies and collapses nucleotide redundancy in pan-genome estimates. We characterized the effect of the panqc adjustment on both pan-genome analysis of Mtb and E. coli genomes, and highlight how different levels of genomic diversity are prone to unique biases. Overall, this study illustrates the need for careful methodological selection and quality control to accurately map the evolutionary dynamics of a bacterial species.

Publication types

  • Preprint