Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 May 11;12(1):2669.
doi: 10.1038/s41467-021-22862-1.

Overcoming false-positive gene-category enrichment in the analysis of spatially resolved transcriptomic brain atlas data

Affiliations

Overcoming false-positive gene-category enrichment in the analysis of spatially resolved transcriptomic brain atlas data

Ben D Fulcher et al. Nat Commun. .

Abstract

Transcriptomic atlases have improved our understanding of the correlations between gene-expression patterns and spatially varying properties of brain structure and function. Gene-category enrichment analysis (GCEA) is a common method to identify functional gene categories that drive these associations, using gene-to-category annotation systems like the Gene Ontology (GO). Here, we show that applying standard GCEA methodology to spatial transcriptomic data is affected by substantial false-positive bias, with GO categories displaying an over 500-fold average inflation of false-positive associations with random neural phenotypes in mouse and human. The estimated false-positive rate of a GO category is associated with its rate of being reported as significantly enriched in the literature, suggesting that published reports are affected by this false-positive bias. We show that within-category gene-gene coexpression and spatial autocorrelation are key drivers of the false-positive bias and introduce flexible ensemble-based null models that can account for these effects, made available as a software toolbox.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Pipeline for applying gene category enrichment analysis (GCEA) to brain-wide expression atlas data.
A Given a phenotype map, we first compute the spatial correlation coefficient between that map and each gene. These gene scores are then agglomerated at the level of categories using an annotation system like the Gene Ontology (GO). For continuous scores, agglomeration is typically performed as the mean score of genes annotated to a category. B Statistical significance of a GO category is assessed relative to the random-gene null, which estimates a null distribution for each GO category by annotating genes to GO categories at random. C For every GO category, a p value is estimated using a permutation test, by comparing the GO category score obtained from the real data to the null distribution.
Fig. 2
Fig. 2. Some GO categories have far higher false-positive rates under randomized spatial phenotypes than statistical expectation, and these GO categories are more likely to be reported as significant in published studies.
A A schematic of three GCEA analyses involving correlations between (i) “reference”—an ensemble of random phenotypes and randomized gene-expression data (green), (ii) “SBP-random”—an ensemble of random phenotypes and real gene-expression data (red), and (iii) “SBP-spatial”—an ensemble of spatially autocorrelated phenotypes and real gene-expression data (blue). For the human cortex, examples of spatial maps in each ensemble are plotted; relative to the SBP-random maps that have no spatial correlation structure, the SBP-spatial maps are more likely to have similar values in nearby locations. B Distributions of the category false-positive rate (CFPR) across all GO categories are shown as violin plots in mouse. Across an ensemble of 10,000 SBP-random or SBP-spatial maps, the CFPR is computed for each GO category as the proportion of phenotypes for which that GO category was found to be significant. Results are shown for the three analyses depicted in A: (i) “reference” (green); (ii) “SBP-random” (red); and (iii) “SBP-spatial” (blue). Note the logarithmic vertical scale (and therefore exclusion of GO categories with CFPR = 0). C The proportion of literature-reported GO categories increases with the CFPR estimated from random phenotypes. GO categories were labeled from a literature survey of GCEA analyses using atlas-based transcriptional data in human and mouse (see Supplementary Information for survey details). Across eight equiprobable bins of CFPR (i.e., each bin contains the same number of GO categories), we plot the proportion of all literature-reported GO categories that are contained in that bin. Results are shown for the SBP-random (red) and SBP-spatial (blue) ensembles in mouse (dotted) and human (solid). The position of each bin is shown as the mean of its extremities.
Fig. 3
Fig. 3. Category false-positive significance rates (CFPRs) vary with within-category gene–gene coexpression and spatial autocorrelation in mouse and human.
A CFPR (%) computed from the SBP-rand ensemble of random spatial maps increases with a measure of mean within-category coexpression, 〈r〉, across ten equiprobable bins in mouse (blue) and human (green). The extent of each bin is displayed as a horizontal line. B The percentage of GO categories that exhibited an increase in CFPR when using the SBP-spatial ensemble relative to the SBP-random ensemble, across ten equiprobable bins of the spatial autocorrelation score, Rexp2. This score captures the goodness of fit of each GO category’s correlated gene expression to an exponential function with distance. More spatially autocorrelated GO categories are more likely to exhibit an increase in CFPR for spatially autocorrelated phenotypes (the SBP-spatial ensemble). The average value across all GO categories is shown as a horizontal dotted line.
Fig. 4
Fig. 4. The statistical significance of a GO category can be quantified relative to conventional random-gene nulls, or ensemble-based null models introduced here.
For a given spatial brain phenotype (SBP) of interest, we depict the process through which null samples are generated for estimating statistical significance for GCEA across three different null models. A The conventional random-gene null tests whether the observed result is more extreme than if genes were assigned to GO categories at random (similar to the illustrated shuffling of gene identities). As this destroys within-category gene–gene correlation structure, it leads to high category false-positive rates for random phenotypes. An alternative is to compute null distributions for each category based on an ensemble of null phenotypes. B The SBP-random null tests whether the observed result is more extreme than if the phenotype of interest was a random spatial map. C The SBP-spatial null tests whether the observed result is more extreme than if the phenotype of interest was a random spatially autocorrelated map.
Fig. 5
Fig. 5. GO category enrichment results depend strongly on the null ensemble.
A Across the range of structural connectome nodal metrics (mouse and human) and cell density phenotypes (mouse), we show those nine phenotypes that individually exhibited categories with significant enrichment according to at least one of the null models. In all but one case, enrichment under the random-gene null is not significant under either of the random-phenotype nulls. B Picking an example enrichment analysis—oligodendrocyte cell density, which has nine significant categories under the random-gene null—we plot the variation in estimated p values (uncorrected) across the three null models (estimated from a Gaussian fit to the null distribution as pZ). The corrected significance threshold, qFDR = 0.05, for the random-gene null is shown as a dashed red line; bars to the right of this line are considered significant at a false discovery rate of 0.05.

Similar articles

Cited by

References

    1. Buzsáki G, Draguhn A. Neuronal oscillations in cortical networks. Science. 2004;304:1926. doi: 10.1126/science.1099745. - DOI - PubMed
    1. Lichtman JW, Denk W. The big and the small: challenges of imaging the brainas circuits. Science. 2011;334:618. doi: 10.1126/science.1209168. - DOI - PubMed
    1. Lein E, et al. Genome-wide atlas of gene expression in the adult mouse brain. Nature. 2007;445:168. doi: 10.1038/nature05453. - DOI - PubMed
    1. Hawrylycz MJ, et al. An anatomically comprehensive atlas of the adult human brain transcriptome. Nature. 2012;489:391. doi: 10.1038/nature11405. - DOI - PMC - PubMed
    1. Arnatkevičiūtė A, Fulcher BD, Fornito A. A practical guide to linking brain-wide gene expression and neuroimaging data. NeuroImage. 2019;189:353. doi: 10.1016/j.neuroimage.2019.01.011. - DOI - PubMed

Publication types

LinkOut - more resources