Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2003 Jul;13(7):1706-18.
doi: 10.1101/gr.903503.

Subsystem identification through dimensionality reduction of large-scale gene expression data

Affiliations
Comparative Study

Subsystem identification through dimensionality reduction of large-scale gene expression data

Philip M Kim et al. Genome Res. 2003 Jul.

Abstract

The availability of parallel, high-throughput biological experiments that simultaneously monitor thousands of cellular observables provides an opportunity for investigating cellular behavior in a highly quantitative manner at multiple levels of resolution. One challenge to more fully exploit new experimental advances is the need to develop algorithms to provide an analysis at each of the relevant levels of detail. Here, the data analysis method non-negative matrix factorization (NMF) has been applied to the analysis of gene array experiments. Whereas current algorithms identify relationships on the basis of large-scale similarity between expression patterns, NMF is a recently developed machine learning technique capable of recognizing similarity between subportions of the data corresponding to localized features in expression space. A large data set consisting of 300 genome-wide expression measurements of yeast was used as sample data to illustrate the performance of the new approach. Local features detected are shown to map well to functional cellular subsystems. Functional relationships predicted by the new analysis are compared with those predicted using standard approaches; validation using bioinformatic databases suggests predictions using the new approach may be up to twice as accurate as some conventional approaches.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The RMS error of NMF and SVD factorizations of the original data as a function of the number of dimensions in the reduced space. For comparison, SVD factorization was also carried out on a random matrix based on the data matrix. The results show that NMF is nearly as good as SVD at reproducing the original data for any dimensionality, and that near a dimensionality of about 50 the marginal increase (slope) in NMF's ability to describe the original data is similar to SVD's ability to match random (unstructured) data. Thus, an NMF dimensionality of 50 is appropriate to describe the structure in the data.
Figure 2
Figure 2
Representation of gene expression data in full and NMF-reduced spaces. (Left column) The original data (log-ratio) is shown for 6 individual experiments in the space of 5346 genes, in the second column from left, the 50-dimensional NMF representation is shown. In the third column from left, the reconstruction from the NMF representation back to the original space (using W · H) is shown. (Right column) The log-ratios of the original (y-axis) are plotted against the log-ratios of the reconstruction from the NMF representation back to the original experimental space (x-axis). The data show that the NMF reduction is capable of regenerating the experiments to relatively high fidelity, and that the NMF representation of an experiment is often dominated by one or a small number of features (basis vectors).
Figure 3
Figure 3
Performance of different spaces at predicting functional relationships between experiments with comparison to the MIPS classification of the deleted genes. (NMF50) NMF space with 50 basis vectors; (Original Space) original gene expression space; (SVD50) SVD space with 50 eigenvectors; (MV50high) space of the 50 most varying genes; (NMF50notsparse) NMF space with 50 basis vector without the sparsification procedure; (SVD50sparse) SVD sparsified; (k-means) predictions taken from k-means clustering with 50 clusters (3176 relationships).
Figure 4
Figure 4
Correlation for four illustrative pairwise functional genetic relationships. For comparison, the correlation plot of the pair of experiments in NMF space is shown at left, and in the original gene space at right.

Similar articles

Cited by

References

    1. Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., and Levine, A.J. 1999. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. 96: 6745-6750. - PMC - PubMed
    1. Alter, O., Brown, P.O., and Botstein, D. 2000. Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl. Acad. Sci. 97: 10101-10106. - PMC - PubMed
    1. Bittner, M., Meltzer, P., Chen, Y., Jiang, J., Seftor, E., Hendrix, M., Radmacher, M., Simon, R., Yakhini, Z., Ben-Dor, A., et al. 2000. Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature 406: 536-540. - PubMed
    1. Broet, P., Richardson, S., and Radvanyi, F. 2002. Bayesian hierarchical model for identifying changes in gene expression from microarray experiments. J. Comput. Biol. 9: 671-683. - PubMed
    1. Brown, C.S., Goodwin, P.C., and Sorger, P.K. 2001. Image metrics in the statistical analysis of DNA microarray data. Proc. Natl. Acad. Sci. 98: 8944-8949. - PMC - PubMed

WEB SITE REFERENCES

    1. http://mips.gsf.de; Munich Information Center for Protein Sequences.
    1. http://www.incyte.com/; Yeast Proteome Database.
    1. http://www.rii.com/register/cell2000102Hughes/EULA.htm; Source Data at Rosetta Inpharmatics.

Publication types

MeSH terms

LinkOut - more resources