Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Jul 27;45(13):e122.
doi: 10.1093/nar/gkx338.

MiSTIC, an Integrated Platform for the Analysis of Heterogeneity in Large Tumour Transcriptome Datasets

Affiliations
Free PMC article

MiSTIC, an Integrated Platform for the Analysis of Heterogeneity in Large Tumour Transcriptome Datasets

Sebastien Lemieux et al. Nucleic Acids Res. .
Free PMC article

Abstract

Genome-wide transcriptome profiling has enabled non-supervised classification of tumours, revealing different sub-groups characterized by specific gene expression features. However, the biological significance of these subtypes remains for the most part unclear. We describe herein an interactive platform, Minimum Spanning Trees Inferred Clustering (MiSTIC), that integrates the direct visualization and comparison of the gene correlation structure between datasets, the analysis of the molecular causes underlying co-variations in gene expression in cancer samples, and the clinical annotation of tumour sets defined by the combined expression of selected biomarkers. We have used MiSTIC to highlight the roles of specific transcription factors in breast cancer subtype specification, to compare the aspects of tumour heterogeneity targeted by different prognostic signatures, and to highlight biomarker interactions in AML. A version of MiSTIC preloaded with datasets described herein can be accessed through a public web server (http://mistic.iric.ca); in addition, the MiSTIC software package can be obtained (github.com/iric-soft/MiSTIC) for local use with personalized datasets.

Figures

Figure 1.
Figure 1.
Visualization of gene expression correlations at different levels of resolution in MiSTIC. (A) Conversion from a dendrogram representation (top) to a classical icicle (bottom). Only clusters of size 3 and above are shown in the icicle. The width and height of peaks indicate the cluster size and the similarity threshold at which it forms. A deep crevice between adjoining peaks indicates a lack of gene correlation between peaks. (B) The icicle is transformed using a power-scale for the similarity measure and circularizing the original plot, the angle corresponding to genes and the radius to similarity measures. The angle at which peaks emerge from the structure reflects the arbitrary ordering of the genes/clusters in the dendrogram. Radius values represent Pearson correlations. (C) Clicking on a peak (arrow in B) generates a graph representation of the corresponding cluster. (D) Selecting two nodes in the cluster (orange labels) and clicking on the scatterplot tab generates a scatterplot representation of samples according to levels of expression of selected genes.
Figure 2.
Figure 2.
Effect of sample size on icicle representation. Datasets with reduced sample size were obtained by resampling for both a subset of the Leucegene AML dataset (69 samples) and of the TCGA breast tumour dataset (754 samples). For each sample size explored, 50 datasets were prepared and correlation matrices built. (A) Distribution of correlation coefficients for one resampled dataset per sample size. Plain lines are used for resampled datasets derived from Leucegene AML and dashed lines for resampled datasets derived from TCGA breast tumours. The gray shade indicates the sample size, ranging from 3 (black) to 700 (light gray). (B) Standard deviation of correlation coefficients obtained from resampled datasets. The deviations shown on the vertical axis correspond to the average computed for the minimum and maximum sample size of both original datasets. Open circles: TCGA; dark circles: Leucegene (C) Icicle representations were built and displayed in MiSTIC for 5, 10, 20 and 50 samples with either all protein-coding genes or only genes coding for transcription factors. Note the increase in peak prominence as sample sizes increase to 50 specimens.
Figure 3.
Figure 3.
Imaging and comparing gene expression clusters in cancer and normal tissues with MiSTIC. (A) Icicle of the Leucegene AML NK dataset (dataset 2, Supplementary Table S1). Blue circles identify named peaks. The mitosis/cell cycle peak is labeled in red. (B) Comparative icicle of Leucegene versus TCGA AML NK datasets (dataset 2 versus 4). Dark shades indicate that similar clusters are found in both datasets while light blue indicates lack of conservation. (C) Icicle of normal TCGA breast samples (dataset 6, Supplementary Table S1). (D) Icicle of the TCGA breast tumour dataset (dataset 7, Supplementary Table S1). Green circles correspond to indicated chromosomal loci. The mitosis/cell cycle peak and peaks corresponding to the ESR1 and ERBB2 gene clusters are labeled in red. (E) Representations of the TCGA breast cancer icicle highlighted for conservation with normal breast tissue (dataset 6), lung adenocarcinoma (dataset 23) or TCGA AML NK (dataset 4). Abbreviations: CC: cell cycle/mitosis, ES: extracellular space, IFN-R: interferon response, HB: haemoglobin, HM: Hox Meis, IR: immune response, LA: lymphocyte activation, MHCII: MHC class II antigen processing and presentation, HIST: histones, PW: Prader-Willi syndrome, R: ribosome, SP: serine-protease activity, TR: transcriptional regulation, Y: Y chromosome.
Figure 4.
Figure 4.
Identification of clusters differentially represented in breast cancer subtypes. (A) The HER2+ breast cancer dataset (#10, Supplementary Table S1) derived from the TCGA breast cancer dataset (#7, Supplementary Table S1) is shown highlighted for conservation with the complementary HER2– breast cancer dataset (#11, Supplementary Table S1). (B) The luminal B breast cancer dataset (#15, Supplementary Table S1) derived from the TCGA breast cancer dataset is shown highlighted for conservation with the luminal A cancer dataset (#14, Supplementary Table S1). (C) Minimum spanning tree representation of the ERBB2 gene cluster. Numbers show the relative position of each gene in the cluster with respect to ERBB2 (taken as origin). Colours represent the location of the genes centromeric (blue) or telomeric (red) with respect to ERBB2. (D) Variations in correlation with ERBB2 gene expression across the 17q12-q21.1 locus. Scatter plots are shown for selected genes within the ERBB2 cluster, evidencing the progressive drop in correlation as the distance from ERBB2 increases. In addition, correlations with RARA, a gene found at the end of the large ERBB2 amplicon and EZH1, a gene situated well outside the amplicon, are also shown. Empty circles correspond to tumours with high expression of ERBB2. Between parentheses are minimum and maximum expression levels in log-transformed RPKM. Gene numbering is shown as in C.
Figure 5.
Figure 5.
Transcriptional networks in the cell cycle/mitosis (CC) cluster. (A) Compilation of genes with ChIP regions associated with at least one of the three factors. Genes associated with one factor (yellow), with two factors (orange) or with three factors (red) are highlighted (see also Supplementary Table S5). (B–D) Specificity of the enrichment of gene sets associated with the presence of ChIP regions for E2F1, MYBL2 or FOXM1 in the CC cluster. (E) Enrichment analysis for the gene set ‘Microarray up MCF7 24 h E2’ from Bourdeau et al. (23) indicates that it is enriched in the CC cluster. Up-regulated estradiol target genes are highlighted in orange in the cluster representation. (F) Selective enrichment of the gene set ‘Microarray up MCF7 24 h E2’ in the CC cluster in the breast cancer icicle.
Figure 6.
Figure 6.
Enrichment analysis of gene signatures used for breast cancer prognosis and subtype classification in the correlation clusters of the TCGA breast cancer icicle. Enrichment is visualized for the gene sets PAM50, Oncotype DX, Mammaprint, GGI and Endopredict.
Figure 7.
Figure 7.
Proliferative genes discriminate between intrinsic breast cancer subtypes and between blood and bone marrow leukaemia samples. (A) Samples in the breast cancer dataset (#7, Supplementary Table S1) were ordered according to expression levels of AURKA and CENPA. Tumours annotated as LumA, LumB, HER2+ and Basal-like were highlighted in different colours as shown in Supplementary Figure S8C. (B) Samples in the AML Leucegene dataset (#1, Supplementary Table S1) were ordered according to expression levels of AURKA and CENPA. Samples annotated as Blood or Bone Marrow were highlighted in different colours as shown in Supplementary Figure S10C. (C) Samples in the AML Leucegene dataset (#1, Supplementary Table S1) are presented according to expression levels of CD34 and HOXA9. Samples in favourable and adverse cytogenetics risk groups are respectively shown in orange and light blue. (D) Data from TCGA showing inclusion of promyelocytic AML (M3: large dots).
Figure 8.
Figure 8.
FOXA1 and FOXC1 mRNA levels define two main sub-populations of tumours corresponding to basal-like versus other tumour types. (A) Minimal spanning tree for the ‘luminal’ cluster. Transcription factor-encoding genes, AR, ESR1, FOXA1, GATA3, SPDEF and XBP1 are highlighted in orange. (B) Waterfall analysis of FOXA1 most correlated and anti-correlated genes. FOXA1 and FOXC1 are highlighted in blue. (C). Breast tumours were sorted according to expression levels of FOXA1 and FOXC1, revealing two main groups of tumours. FOXA1hiFOXC1lo tumours include the lumA, lumB, HER2+ and normal-like groups, while the FOXA1loFOXC1hi group coincides with basal-like tumours.

Similar articles

See all similar articles

Cited by 7 articles

See all "Cited by" articles

References

    1. Spellman P.T., Sherlock G., Zhang M.Q., Iyer V.R., Anders K., Eisen M.B., Brown P.O., Botstein D., Futcher B. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell. 1998; 9:3273–3297. - PMC - PubMed
    1. Bono H., Okazaki Y. Functional transcriptomes: comparative analysis of biological pathways and processes in eukaryotes to infer genetic networks among transcripts. Curr. Opin. Struct. Biol. 2002; 12:355–361. - PubMed
    1. Hong S., Chen X., Jin L., Xiong M. Canonical correlation analysis for RNA-seq co-expression networks. Nucleic Acids Res. 2013; 41:e95. - PMC - PubMed
    1. Iancu O.D., Kawane S., Bottomly D., Searles R., Hitzemann R., McWeeney S. Utilizing RNA-Seq data for de novo coexpression network inference. Bioinformatics. 2012; 28:1592–1597. - PMC - PubMed
    1. International Cancer Genome, C. Hudson T.J., Anderson W., Artez A., Barker A.D., Bell C., Bernabe R.R., Bhan M.K., Calvo F., Eerola I. et al. International network of cancer genome projects. Nature. 2010; 464:993–998. - PMC - PubMed

MeSH terms

Substances

Feedback