Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Feb 15;31(4):545-54.
doi: 10.1093/bioinformatics/btu674. Epub 2014 Oct 21.

Statistical significance of variables driving systematic variation in high-dimensional data

Affiliations

Statistical significance of variables driving systematic variation in high-dimensional data

Neo Christopher Chung et al. Bioinformatics. .

Abstract

Motivation: There are a number of well-established methods such as principal component analysis (PCA) for automatically capturing systematic variation due to latent variables in large-scale genomic data. PCA and related methods may directly provide a quantitative characterization of a complex biological variable that is otherwise difficult to precisely define or model. An unsolved problem in this context is how to systematically identify the genomic variables that are drivers of systematic variation captured by PCA. Principal components (PCs) (and other estimates of systematic variation) are directly constructed from the genomic variables themselves, making measures of statistical significance artificially inflated when using conventional methods due to over-fitting.

Results: We introduce a new approach called the jackstraw that allows one to accurately identify genomic variables that are statistically significantly associated with any subset or linear combination of PCs. The proposed method can greatly simplify complex significance testing problems encountered in genomics and can be used to identify the genomic variables significantly associated with latent variables. Using simulation, we demonstrate that our method attains accurate measures of statistical significance over a range of relevant scenarios. We consider yeast cell-cycle gene expression data, and show that the proposed method can be used to straightforwardly identify genes that are cell-cycle regulated with an accurate measure of statistical significance. We also analyze gene expression data from post-trauma patients, allowing the gene expression data to provide a molecularly driven phenotype. Using our method, we find a greater enrichment for inflammatory-related gene sets compared to the original analysis that uses a clinically defined, although likely imprecise, phenotype. The proposed method provides a useful bridge between large-scale quantifications of systematic variation and gene-level significance analyses.

Availability and implementation: An R software package, called jackstraw, is available in CRAN.

Contact: jstorey@princeton.edu.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Illustration of systematic variation genomic data due to latent variables. Complex biological variables, such as clinical subtypes and cell-cycle regulation, may be difficult to define, measure, or model. Instead, we can characterize the manifestation of latent variables, L(z), directly from high-dimensional genomic data using PCA and related methods. The proposed method calculates the statistical significance of associations between variables in Y and estimates of L, while accounting for over-fitting due to the fact that L must be estimated from Y
Fig. 2.
Fig. 2.
Identification of yeast genes associated with the cell-cycle regulation. (a) The top two PCs of gene expression measured over time in a population of yeast whose cell cycles have been synchronized by elutriation; these PCs appear to capture cell-cycle regulation patterns (Spellman et al., 1998). The dashed lines are natural cubic smoothing splines fit to each PC, respectively (with 5 degrees of freedom). (b) The percent variance explained by PCs shows that the top two PCs capture 48% of the total variance in the data. (c) Hierarchical clustering of expression levels of genes significantly associated with the top two PCs at FDR1%, where rows are genes and columns are time points. Hierarchical clustering was applied to this subset of 2998 genes
Fig. 3.
Fig. 3.
A schematic of the general steps of the proposed algorithm to calculate the statistical significance of associations between variables (rows in Y) and their top r PCs (VrT). By independently permuting a small number (s) of variables and recalculating the PCs, we generate tractable “synthetic” null variables while preserving the overall systematic variation. Association statistics between the s synthetic null variables in Y* and Vr*T form the empirical null distribution, automatically taking account over-fitting intrinsic to testing for associations between a set of observed variables and their PCs
Fig. 4.
Fig. 4.
Sixteen simulation scenarios generated by combining four design factors. To assess the statistical accuracy of the conventional F-test and the proposed method, we simulated 500 independent studies for each scenario, and assessed statistical accuracy according to the “joint null criterion” (Leek and Storey, 2011). For the bi{1,1} scenarios, non-null coefficients were set to either -1 or 1 with a probability of 0.5. For a given simulation study, a valid statistical testing procedure must yield a set of null P values that are jointly distributed Uniform(0,1). We use a KS test to identify deviations from the Uniform(0,1) distribution. Supplementary Material, Figure S3 provides a detailed overview of the evaluation pipeline
Fig. 5.
Fig. 5.
Evaluation of significance measures of associations between variables and their PCs by comparing true null P values and the Uniform(0,1) distribution. (a) The conventional F-test results in anti-conservative P values, as demonstrated by null P values being skewed towards 0. (b) The proposed method produces null P values distributed Uniform(0,1). The dashed line shows the Uniform(0,1) density function
Fig. 6.
Fig. 6.
QQ-plots of double KS test P values from 16 simulation scenarios versus the Uniform(0,1) distribution. For each of 500 independent studies per scenario, we tested for deviation of null P values from Uniform(0,1), resulting in 500 KS test P values for each scenario. An individual point in the QQ-plot represents a double KS test P value for one scenario, comparing its 500 KS test P values to Uniform(0,1). On the left panel, the systematic downward displacement of 16 black points indicates an anti-conservative bias of the conventional F-test. In contrast, the proposed method produces null P values that are not anti-conservative. On the right panel, a set of 16 points are below the diagonal red line if the joint null distribution deviates from the Uniform(0,1) distribution. The proposed method adjusts for over-fitting of PCA and produces accurate estimates of association significance

Similar articles

Cited by

References

    1. Alizadeh AA, et al. Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature. 2000;403:503–511. - PubMed
    1. Alter O, et al. Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl Acad. Sci. USA. 2000;97:10101–10106. - PMC - PubMed
    1. Anderson TW. Asymptotic theory for principal component analysis. Ann. Math. Stat. 1963;34:122–148.
    1. Buja A, Eyuboglu N. Remarks on parallel analysis. Multivar. Behav. Res. 1992;27:509–540. - PubMed
    1. Cho RJ, et al. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell. 1998;2:65–73. - PubMed

Publication types

Substances