Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Nov 21;2(11):e1195.
doi: 10.1371/journal.pone.0001195.

Subclass mapping: identifying common subtypes in independent disease data sets

Affiliations

Subclass mapping: identifying common subtypes in independent disease data sets

Yujin Hoshida et al. PLoS One. .

Abstract

Whole genome expression profiles are widely used to discover molecular subtypes of diseases. A remaining challenge is to identify the correspondence or commonality of subtypes found in multiple, independent data sets generated on various platforms. While model-based supervised learning is often used to make these connections, the models can be biased to the training data set and thus miss inherent, relevant substructure in the test data. Here we describe an unsupervised subclass mapping method (SubMap), which reveals common subtypes between independent data sets. The subtypes within a data set can be determined by unsupervised clustering or given by predetermined phenotypes before applying SubMap. We define a measure of correspondence for subtypes and evaluate its significance building on our previous work on gene set enrichment analysis. The strength of the SubMap method is that it does not impose the structure of one data set upon another, but rather uses a bi-directional approach to highlight the common substructures in both. We show how this method can reveal the correspondence between several cancer-related data sets. Notably, it identifies common subtypes of breast cancer associated with estrogen receptor status, and a subgroup of lymphoma patients who share similar survival patterns, thus improving the accuracy of a clinical outcome predictor.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Subclass mapping (SubMap) methodology. Two independent data sets, A and B, are clustered separately, compared and integrated.
(a) Candidate subclasses are defined by clustering A and B (predetermined phenotype can also be used). Marker genes of each candidate subclass in A (Ai) are selected, and mapped onto a gene list ranked according to their differential expression with respect to a subclass of B (Bj). Their over-representation at the top of the ranking is evaluated using the enrichment score (ESAiBj), and significance is assessed as a nominal p-value, pAiBj, by randomly permuting sample class labels in B. This process is repeated by interchanging the role of A and B to compute ESBjAi and pBjAi. (b) Mutual enrichment information, pAiBj and pBjAi, are combined using the Fisher inverse chi-square statistic, Fij. Its significance is estimated based on a null distribution for the Fij generated by randomly picking the nominal-p from corresponding null distributions for ESAiBj and ESBjAi. After multiple hypothesis testing (MHT) correction, p-values for Fij are summarized in the subclass association (SA) matrix. Clustering of the SA matrix reveals subclasses common to A and B.
Figure 2
Figure 2. Example 1: Multiple tissue types.
(a) SubMap was applied to two data sets, Multi-A and Multi-B, containing multiple tissue types: breast (Br), prostate (Pr), lung (Lu), and colon (Co). Bonferroni-corrected p-values for breast, prostate, lung, and colon tissues were 0.002, 0.002, 0.002, and 0.002, respectively. (b) Each tissue type in Multi-B was removed before applying SubMap. Only subsets of the same tissue type were significantly associated (Bonferoni-corrected p<0.05). The p-values for “Multi-A-Br and Multi-B-Lu (left-upper)”, “Multi-A-Lu and Multi-B-Br (left-bottom)”, and “Multi-A-Co and Multi-B-Br (right-bottom)” are 0.330, 0.547, and 0.517, respectively
Figure 3
Figure 3. Example 2: Common subtypes of Diffuse Large B-cell Lymphoma (DLBCL).
SubMap was applied for three subclasses of DLBCL pre-determined in DLBCL-A and DLBCL-B data sets. Bonferroni-corrected p-values for “oxidative phosphorylation (OxPhos)”, “B-cell response (BCR)”, and “host response (HR)” subtypes were 0.008, 0.001, and 0.001, respectively. The association for the pair of DLBCL-A-BCR and DLBCL-B-OxPhos was not significant (p = 0.362).
Figure 4
Figure 4. Example 3: Common subtypes of breast cancer associated with estrogen receptor (ER) status.
(a) Candidate subclass labels were assigned using hierarchical clustering in Breast-A and Breast-B data sets independently. (b) Subclass association (SA) matrix for Breast-A and Breast-B. Bonferroni-corrected p-values for the combinations of “A1 and B2”, “A1 and B4“, “A2 and B1”, “A3 and B1”, and “A3 and B3” were 0.070, 0.002, 0.023, 0.001, and 0.055, respectively (FDR-corrected p-values of 0.014, 0.001, 0.008, 0.001, and 0.014, respectively). *: ER status is missing for one case.
Figure 5
Figure 5. Example 4: Survival prediction in Diffuse Large B-cell lymphoma (DLBCL) data sets.
(a) Subclass association (SA) matrix for the comparison between DLBCL-C and DLBCL-D data sets. Bonferroni-corrected p-values for the pairs of “C3 and D2” and “C4 and D3” were 0.002 and 0.002, respectively, (b) Survival prediction models were built using DLBCL-C and applied to DLBCL-D. Kaplan-Meier survival curves for the predicted groups in DLBCL-D are shown. Left: Prediction model was trained using all cases in DLBCL-C (n = 58), and tested on all cases in DLBCL-D (n = 129). Middle: Survival prediction using only cases from “matched” subclasses. Model was trained using C3 ∪ C4 samples (n = 25) and tested in D2 ∪ D3 (n = 61). The survival separation was better than that in the left panel in spite of having fewer samples. Right: survival prediction using only cases from “unmatched” subclasses. Model was trained using C1 ∪ C2 (n = 33) and tested in D1 ∪ D4 (n = 68). The numbers of events were 61, 22, and 39 for all (D1 ∪ D2 ∪ D3 ∪ D4), “matched” (D2 ∪ D3), and “unmatched” (D1 ∪ D4) patients, respectively. p-values were calculated using the log-rank test. DFS: disease free survival.

Similar articles

Cited by

References

    1. Michiels S, Koscielny S, Hill C. Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet. 2005;365:488–492. - PubMed
    1. Larkin JE, Frank BC, Gavras H, Sultana R, Quackenbush J. Independence and reproducibility across microarray platforms. Nat Methods. 2005;2:337–344. - PubMed
    1. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005;102:15545–15550. - PMC - PubMed
    1. Fisher RA. London: Oliver and Boyd; 1932. Statistical Methods for Research Workers.
    1. Su AI, Cooke MP, Ching KA, Hakak Y, Walker JR, et al. Large-scale analysis of the human and mouse transcriptomes. Proc Natl Acad Sci U S A. 2002;99:4465–4470. - PMC - PubMed

Publication types

MeSH terms

Substances