Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Mar;4(3):213-24.
doi: 10.1016/S2213-2600(16)00048-5. Epub 2016 Feb 20.

Genome-wide Expression for Diagnosis of Pulmonary Tuberculosis: A Multicohort Analysis

Free PMC article

Genome-wide Expression for Diagnosis of Pulmonary Tuberculosis: A Multicohort Analysis

Timothy E Sweeney et al. Lancet Respir Med. .
Free PMC article


Background: Active pulmonary tuberculosis is difficult to diagnose and treatment response is difficult to effectively monitor. A WHO consensus statement has called for new non-sputum diagnostics. The aim of this study was to use an integrated multicohort analysis of samples from publically available datasets to derive a diagnostic gene set in the peripheral blood of patients with active tuberculosis.

Methods: We searched two public gene expression microarray repositories and retained datasets that examined clinical cohorts of active pulmonary tuberculosis infection in whole blood. We compared gene expression in patients with either latent tuberculosis or other diseases versus patients with active tuberculosis using our validated multicohort analysis framework. Three datasets were used as discovery datasets and meta-analytical methods were used to assess gene effects in these cohorts. We then validated the diagnostic capacity of the three gene set in the remaining 11 datasets.

Findings: A total of 14 datasets containing 2572 samples from 10 countries from both adult and paediatric patients were included in the analysis. Of these, three datasets (N=1023) were used to discover a set of three genes (GBP5, DUSP3, and KLF2) that are highly diagnostic for active tuberculosis. We validated the diagnostic power of the three gene set to separate active tuberculosis from healthy controls (global area under the ROC curve (AUC) 0·90 [95% CI 0·85-0·95]), latent tuberculosis (0·88 [0·84-0·92]), and other diseases (0·84 [0·80-0·95]) in eight independent datasets composed of both children and adults from ten countries. Expression of the three-gene set was not confounded by HIV infection status, bacterial drug resistance, or BCG vaccination. Furthermore, in four additional cohorts, we showed that the tuberculosis score declined during treatment of patients with active tuberculosis.

Interpretation: Overall, our integrated multicohort analysis yielded a three-gene set in whole blood that is robustly diagnostic for active tuberculosis, that was validated in multiple independent cohorts, and that has potential clinical application for diagnosis and monitoring treatment response. Prospective laboratory validation will be required before it can be used in a clinical setting.

Funding: National Institute of Allergy and Infectious Diseases, National Library of Medicine, the Stanford Child Health Research Institute, the Society for University Surgeons, and the Bill and Melinda Gates Foundation.


Figure 1
Figure 1. Multicohort analysis
Schematic of the multicohort analysis workflow
Figure 2
Figure 2. Forest plots for each of the three genes derived in the forward search
Forest plots for each of the three genes derived in the forward search. The x axis represent standardised mean difference between latent tuberculosis and other diseases versus active tuberculosis. The size of the blue rectangles is inversely proportional to the SE of mean in the study. Whiskers represent the 95% CI. The orange diamonds represent overall, combined mean difference for a given gene. Width of the diamonds represents the 95% CI of overall combined mean difference. FDR=false discovery rate.
Figure 3
Figure 3. Performance of the three-gene set in the discovery datasets
ROC curves in discovery cohorts showing healthy controls (A), patients with latent tuberculosis (B), and patients with other diseases (C) versus patients with active tuberculosis. Healthy patients were not included in the multicohort analysis but are shown here. ROC curves in four validation cohorts comparing healthy controls with active tuberculosis (D), patients with latent tuberculosis with patients with active tuberculosis, and ROC curves in three validation cohorts comparing patients with other diseases with active tuberculosis (E). Violin plots with patient-level data are shown in figure 6 and appendix pp 3, 5, 6. ROC=receiver operating characteristic. AUC=area under the curve.
Figure 4
Figure 4. Establishment of a single global test cutoff in the validation datasets
Sample-level normalised gene scores and group tuberculosis score distributions. Cohorts are shown. Bars within violin plots show IQR; white dashes show medians. By centering the genes within each dataset to their global mean, a single cutoff across multiple datasets can be established.
Figure 5
Figure 5. Effect of HIV co-infection on the diagnostic power of the tuberculosis score
In GSE37250, GSE39939, and GSE39940, no significant difference was noted in the diagnostic power for other diseases versus active tuberculosis based on HIV status. In GSE37250, there was a decrease in ROC AUC from 0·96 to 0·89 in latent tuberculosis versus active tuberculosis in HIV-positive patients. ROC=receiver operating characteristic. AUC=area under the curve.
Figure 6
Figure 6. Violin plots showing the performance of the three-gene set in longitudinal validation datasets
Four validation datasets examined active tuberculosis patients during treatment and recovery. All four datasets took samples before and during treatment. The tuberculosis score falls over time of treatment. GSE56153 also included healthy controls; the tuberculosis score returned to normal after treatment (Wilcoxon p=not significant between cured cases and healthy controls; C). GSE62147 also examined active Mycobacterium africanum infections (D).

Comment in

Similar articles

See all similar articles

Cited by 76 articles

See all "Cited by" articles

Publication types

MeSH terms