Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jul;14(4):1539-1549.
doi: 10.1111/1751-7915.13815. Epub 2021 May 21.

Combining comparative genomic analysis with machine learning reveals some promising diagnostic markers to identify five common pathogenic non-tuberculous mycobacteria

Affiliations

Combining comparative genomic analysis with machine learning reveals some promising diagnostic markers to identify five common pathogenic non-tuberculous mycobacteria

Xinmiao Jia et al. Microb Biotechnol. 2021 Jul.

Abstract

Non-tuberculous mycobacteria (NTM) can cause various respiratory diseases and even death in severe cases, and its incidence has increased rapidly worldwide. To date, it's difficult to use routine diagnostic methods and strain identification to precisely diagnose various types of NTM infections. We combined systematic comparative genomics with machine learning to select new diagnostic markers for precisely identifying five common pathogenic NTMs (Mycobacterium kansasii, Mycobacterium avium, Mycobacterium intracellular, Mycobacterium chelonae, Mycobacterium abscessus). A panel including six genes and two SNPs (nikA, benM, codA, pfkA2, mpr, yjcH, rrl C2638T, rrl A1173G) was selected to simultaneously identify the five NTMs with high accuracy (> 90%). Notably, the panel only containing the six genes also showed a good classification effect (accuracy > 90%). Additionally, the two panels could precisely differentiate the five NTMs from M. tuberculosis (accuracy > 99%). We also revealed some new marker genes/SNPs/combinations to accurately discriminate any one of the five NTMs separately, which provided the possibility to diagnose one certain NTM infection precisely. Our research not only reveals novel promising diagnostic markers to promote the development of precision diagnosis in NTM infectious, but also provides an insight into precisely identifying various genetically close pathogens through comparative genomics and machine learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they are inventors on various patent applications covering several of the methods and results reported here.

Figures

Fig. 1
Fig. 1
Phylogenetic analysis of 123 NTM and 40 Mtb strains with complete genomes. The five common pathogenic NTMs are shown in different colours. The phylogenetic tree was constructed based on 289 751 core gene SNPs shared by these strains.
Fig. 2
Fig. 2
Flower plot showing the core, dispensable, and species‐specific genes of the five NTM species. Mtb were included as controls. The flower plot displays the core gene cluster number (in the centre), the dispensable gene number (in the annulus), and the species‐specific gene number (in the petals) of the five NTM species. The numbers under the species name denote the core gene numbers of related species.
Fig. 3
Fig. 3
Importance of the optimized gene/SNP combinations to identify the five common pathogenic NTM species using random forest models. A. Mean Decrease Gini coefficient of the optimized gene combinations. B. Mean Decrease Accuracy of the optimized gene combinations. C. Mean Decrease Gini coefficient of the optimized SNP combinations. D. Mean Decrease Accuracy of the optimized SNP combinations.
Fig. 4
Fig. 4
Ensemble classification workflow for data generation and analysis of Gene/SNP panels for simultaneously discriminating the five common pathogenic NTM species. The multiclass classifier was proposed based on the above RF binary classifier (‘non‐Mka’ indicates the Mav, Min, Mch, Mab and Mtb strains; ‘non‐Mav’ indicates the Mka, Min, Mch, Mab and Mtb strains; ‘non‐Min’ indicates the Mka, Mav, Mch, Mab and Mtb strains; ‘non‐Mch’ indicates the Mka, Mav, Min, Mab and Mtb strains; ‘non‐Mab’ indicates the Mka, Mav, Min, Mch and Mtb strains). Confusion matrixes of the gene/SNP panels in the training set and test set were shown on the left and right, respectively.

Similar articles

Cited by

References

    1. Andrew, J.P. , Carla, A.C. , Martin, H. , Vanessa, K.W. , Sandra, R. , and Matthew, T.G.H. et al. (2015) Roary: rapid large‐scale prokaryote pan genome analysis. Bioinformatics 31: 3691–3693. - PMC - PubMed
    1. Aitken, M.L. , Limaye, A. , Pottinger, P. , Whimbey, E. , Goss, C.H. , Tonelli, M.R. , et al. (2012) Respiratory outbreak of Mycobacterium abscessus subspecies massiliense in a lung transplant and cystic fibrosis center. Am J Respir Crit Care Med 185: 231–232. - PubMed
    1. Blanchet, L. , Vitale, R. , van Vorstenbosch, R. , Stavropoulos, G. , Pender, J. , Jonkers, D. , et al. (2020) Constructing bi‐plots for random forest. Tutorial. Anal Chim Acta 1131: 146–155. - PubMed
    1. Bramer, M. (2013) Ensemble classification. Principles of Data Mining. Undergraduate Topics in Computer Science. London: Springer, pp. 209–220.
    1. Bryant, J.M. , Grogono, D.M. , Greaves, D. , Foweraker, J. , Roddick, I. , Inns, T. , et al. (2013) Whole‐genome sequencing to identify transmission of Mycobacterium abscessus between patients with cystic fibrosis: a retrospective cohort study. Lancet 381: 1551–1560. - PMC - PubMed

Publication types

MeSH terms