Statistical significance of variables driving systematic variation in high-dimensional data
- PMID: 25336500
- PMCID: PMC4325543
- DOI: 10.1093/bioinformatics/btu674
Statistical significance of variables driving systematic variation in high-dimensional data
Abstract
Motivation: There are a number of well-established methods such as principal component analysis (PCA) for automatically capturing systematic variation due to latent variables in large-scale genomic data. PCA and related methods may directly provide a quantitative characterization of a complex biological variable that is otherwise difficult to precisely define or model. An unsolved problem in this context is how to systematically identify the genomic variables that are drivers of systematic variation captured by PCA. Principal components (PCs) (and other estimates of systematic variation) are directly constructed from the genomic variables themselves, making measures of statistical significance artificially inflated when using conventional methods due to over-fitting.
Results: We introduce a new approach called the jackstraw that allows one to accurately identify genomic variables that are statistically significantly associated with any subset or linear combination of PCs. The proposed method can greatly simplify complex significance testing problems encountered in genomics and can be used to identify the genomic variables significantly associated with latent variables. Using simulation, we demonstrate that our method attains accurate measures of statistical significance over a range of relevant scenarios. We consider yeast cell-cycle gene expression data, and show that the proposed method can be used to straightforwardly identify genes that are cell-cycle regulated with an accurate measure of statistical significance. We also analyze gene expression data from post-trauma patients, allowing the gene expression data to provide a molecularly driven phenotype. Using our method, we find a greater enrichment for inflammatory-related gene sets compared to the original analysis that uses a clinically defined, although likely imprecise, phenotype. The proposed method provides a useful bridge between large-scale quantifications of systematic variation and gene-level significance analyses.
Availability and implementation: An R software package, called jackstraw, is available in CRAN.
Contact: jstorey@princeton.edu.
© The Author 2014. Published by Oxford University Press.
Figures
Similar articles
-
Spectral gene set enrichment (SGSE).BMC Bioinformatics. 2015 Mar 3;16:70. doi: 10.1186/s12859-015-0490-7. BMC Bioinformatics. 2015. PMID: 25879888 Free PMC article.
-
A new statistic for identifying batch effects in high-throughput genomic data that uses guided principal component analysis.Bioinformatics. 2013 Nov 15;29(22):2877-83. doi: 10.1093/bioinformatics/btt480. Epub 2013 Aug 19. Bioinformatics. 2013. PMID: 23958724 Free PMC article.
-
Clustering of diverse genomic data using information fusion.Bioinformatics. 2005 Feb 15;21(4):423-9. doi: 10.1093/bioinformatics/bti186. Epub 2004 Dec 17. Bioinformatics. 2005. PMID: 15608052
-
Supervised principal component analysis for gene set enrichment of microarray data with continuous or survival outcomes.Bioinformatics. 2008 Nov 1;24(21):2474-81. doi: 10.1093/bioinformatics/btn458. Epub 2008 Aug 27. Bioinformatics. 2008. PMID: 18753155 Free PMC article.
-
Using GenePattern for gene expression analysis.Curr Protoc Bioinformatics. 2008 Jun;Chapter 7:7.12.1-7.12.39. doi: 10.1002/0471250953.bi0712s22. Curr Protoc Bioinformatics. 2008. PMID: 18551415 Free PMC article. Review.
Cited by
-
Molecularly stratified hypothalamic astrocytes are cellular foci for obesity.Res Sq [Preprint]. 2024 Feb 9:rs.3.rs-3748581. doi: 10.21203/rs.3.rs-3748581/v1. Res Sq. 2024. PMID: 38405925 Free PMC article. Preprint.
-
Single-cell RNA-seq reveals a link of ovule abortion and sugar transport in Camellia oleifera.Front Plant Sci. 2024 Feb 2;15:1274013. doi: 10.3389/fpls.2024.1274013. eCollection 2024. Front Plant Sci. 2024. PMID: 38371413 Free PMC article.
-
Tick innate immune responses to hematophagy and Ehrlichia infection at single-cell resolution.Front Immunol. 2024 Jan 11;14:1305976. doi: 10.3389/fimmu.2023.1305976. eCollection 2023. Front Immunol. 2024. PMID: 38274813 Free PMC article.
-
Selective inference for -means clustering.J Mach Learn Res. 2023 May;24:152. J Mach Learn Res. 2023. PMID: 38264325 Free PMC article.
-
SURGE: uncovering context-specific genetic-regulation of gene expression from single-cell RNA sequencing using latent-factor models.Genome Biol. 2024 Jan 22;25(1):28. doi: 10.1186/s13059-023-03152-z. Genome Biol. 2024. PMID: 38254214 Free PMC article.
References
-
- Alizadeh AA, et al. Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature. 2000;403:503–511. - PubMed
-
- Anderson TW. Asymptotic theory for principal component analysis. Ann. Math. Stat. 1963;34:122–148.
-
- Buja A, Eyuboglu N. Remarks on parallel analysis. Multivar. Behav. Res. 1992;27:509–540. - PubMed
-
- Cho RJ, et al. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell. 1998;2:65–73. - PubMed
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases
