Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Sep 1;39(17):7380-9.
doi: 10.1093/nar/gkr462. Epub 2011 Jun 19.

An Intuitive Graphical Visualization Technique for the Interrogation of Transcriptome Data

Affiliations
Free PMC article

An Intuitive Graphical Visualization Technique for the Interrogation of Transcriptome Data

Natascha Bushati et al. Nucleic Acids Res. .
Free PMC article

Abstract

The complexity of gene expression data generated from microarrays and high-throughput sequencing make their analysis challenging. One goal of these analyses is to define sets of co-regulated genes and identify patterns of gene expression. To date, however, there is a lack of easily implemented methods that allow an investigator to visualize and interact with the data in an intuitive and flexible manner. Here, we show that combining a nonlinear dimensionality reduction method, t-statistic Stochastic Neighbor Embedding (t-SNE), with a novel visualization technique provides a graphical mapping that allows the intuitive investigation of transcriptome data. This approach performs better than commonly used methods, offering insight into underlying patterns of gene expression at both global and local scales and identifying clusters of similarly expressed genes. A freely available MATLAB-implemented graphical user interface to perform t-SNE and nearest neighbour plots on genomic data sets is available at www.nimr.mrc.ac.uk/research/james-briscoe/visgenex.

Figures

Figure 1.
Figure 1.
t-SNE mappings and PCA of two high-dimensional gene expression data sets. (a and b) t-SNE maps of 2148 probe sets identified as differentially expressed between six stages of human embryogenesis (10) (a); and of 3656 probe sets with periodic behaviour over 36 cycles in the yeast metabolic cycle described by Tu et al. (12) (b). Selected groups of neighbouring data points are highlighted and the expression behaviour (plotted as z-scores) of the selected genes over all conditions shown in the corresponding colours. S9–S14: carnegie stages 9–14; T0–T36: time points 0–36. (c and d) Plots of the values of the first and second principal components of the same probe sets used to produce the t-SNE maps in (a and b).
Figure 2.
Figure 2.
t-SNE mappings have a high degree of local validity. We compared the quality of t-SNE mappings and projections of the first two PCs for each of the five data sets (a–e). In each case three measures of quality were used: (i) The distance between each data point and all other data points was determined and a rank ordering of neighbours of each data point constructed. The median rank ordering of the neighbours in V-space was compared to the rank orderings in original H-space (Red, PC projection; Blue, t-SNE mapping). Conversely the median rank of the neighbours of data points in H-space was compared to the rank of neighbours in V-space (Magenta, PC projection; Cyan, t-SNE mapping). The closest 100 neighbours are shown in the figure. An optimal method would produce a median rank of neighbours equal to the original rank—this is indicated by the dashed black lines along the main diagonal of the graphs. For each data set, the t-SNE mappings performed better than PCA. (ii) Histogram of co-ranking matrices comparing the rank of neighbours in PC1-PC2 projections (left) or t-SNE maps (right) with the rank of neighbours in H-space (24). The neighbours of every H-point, ranked according to Euclidian distance in H-space, were compared to the distance-ranked neighbours of the corresponding V-point. The joint histogram of these co-ranking matrices was plotted to display the number of neighbours of specific ranks in V-space as a function of the original H-space neighbour ranks. The standard ‘jet’ colour map (MATLAB) was used to indicate the log base 10 of the number of co-ranked neighbours: red indicates high numbers, blue low numbers and the first 30 ranks are displayed. Optimal performance would produce neighbours in the same rank ordering in H-space and V-space. The increased numbers of co-ranked neighbours along the main diagonal shows the increased number of equally ranked neighbours produced by t-SNE compared to PCA. (iii) Plots of the q-score, as defined by Lee and Verleysen (24), for PCA (red) and t-SNE (blue) mappings of each data set. The co-ranking matrix (see above) was used to calculate the error in neighbour ranking in V-space compared to H-space and transformed into a measure of quality of the projections. This measure was plotted as a cumulative score for neighbour ranks. In this plot, higher values of Q for low ranked neighbours (points close together in H-space) indicate better quality local validity in V-space. For each data set, t-SNE outperformed PCA.
Figure 3.
Figure 3.
Data set 1: t-SNE mappings and nearest neighbour plots provide a means to evaluate and refine clustering of co-expressed genes. (a) Nearest neighbour plot of the t-SNE mappings in Figure 1b. Each data point in the t-SNE map was connected to its two nearest neighbours in high-dimensional (6D) space and the connectors coloured according to the distance between these data points in high-dimensional space. Red indicates short, and blue long distances in the higher dimensional space. Thus short red lines indicate faithful projection of distances. (b and c) Overlay of clusters of putatively co-regulated genes on to the t-SNE map obtained from the human embryogenesis data set. Data points are coloured according to cluster membership. (b) Overlay of clusters 1–6 produced using SOMs from the original study (10). (c) Overlay of 10 clusters produced by re-analysis of the original data using hierarchical clustering (left panel) or k-means clustering (right panel), respectively.
Figure 4.
Figure 4.
Data set 2: t-SNE mappings and nearest neighbour plots provide a means to evaluate and refine clustering of co-expressed genes. (a) Nearest neighbour plot of the t-SNE mappings in Figure 1a. Each data point in the t-SNE map was connected to its two nearest neighbours in high-dimensional (36D) space and the connectors coloured according to the distance between these data points in high-dimensional space. Red indicates short, and blue long distances in the higher dimensional space. Thus short red lines indicate faithful projection of distances. (b and c) t-SNE map overlays of three clusters representing the main periodic behaviours in the yeast metabolic cycle as described in Tu et al. (12). Data points are coloured according to cluster membership. (b) Overlay onto the t-SNE map produced from the probe set identified as periodic in the original study. (c) Overlay onto t-SNE maps from the original data set filtered using F-score cut-offs. F-scores were calculated by considering corresponding time points from consecutive cycles as biological replicates.

Similar articles

See all similar articles

Cited by 15 articles

See all "Cited by" articles

References

    1. Gehlenborg N, O'Donoghue SI, Baliga NS, Goesmann A, Hibbs MA, Kitano H, Kohlbacher O, Neuweger H, Schneider R, Tenenbaum D, et al. Visualization of omics data for systems biology. Nat. Methods. 2010;7:S56–68. - PubMed
    1. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA. 1998;95:14863–14868. - PMC - PubMed
    1. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM. Systematic determination of genetic network architecture. Nat. Genet. 1999;22:281–285. - PubMed
    1. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl Acad. Sci. USA. 1999;96:2907–2912. - PMC - PubMed
    1. Hotelling H. Analysis of complex statistical variables into principal components. J. Educ. Psychol. 1933;24:417–441.

Publication types

Feedback