Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 May;8(5):e1002529.
doi: 10.1371/journal.pcbi.1002529. Epub 2012 May 31.

Exploring massive, genome scale datasets with the GenometriCorr package

Affiliations
Free PMC article

Exploring massive, genome scale datasets with the GenometriCorr package

Alexander Favorov et al. PLoS Comput Biol. .
Free PMC article

Abstract

We have created a statistically grounded tool for determining the correlation of genomewide data with other datasets or known biological features, intended to guide biological exploration of high-dimensional datasets, rather than providing immediate answers. The software enables several biologically motivated approaches to these data and here we describe the rationale and implementation for each approach. Our models and statistics are implemented in an R package that efficiently calculates the spatial correlation between two sets of genomic intervals (data and/or annotated features), for use as a metric of functional interaction. The software handles any type of pointwise or interval data and instead of running analyses with predefined metrics, it computes the significance and direction of several types of spatial association; this is intended to suggest potentially relevant relationships between the datasets.

Availability and implementation: The package, GenometriCorr, can be freely downloaded at http://genometricorr.sourceforge.net/. Installation guidelines and examples are available from the sourceforge repository. The package is pending submission to Bioconductor.

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Two types of graphic output are available.
(A) A statistical summary and ECDF plots. (B) A graphical interpretation of the spatial relationships. The query features are depicted along the plot according to their distance to a reference feature; the colors indicate deviation from the expected distribution while the overlay line indicates the density of the data at each absolute or relative distance. The data density mirrors but is independent from the log-odds colors; at small distances in the absolute distance plot the data density is higher than expected but this represents a very small percentage of the total query points.
Figure 2
Figure 2. NFkappaB sites vs human RefSeq promoter start sites.
Query and reference colors as in Figure 1. (A) NFkappaB as the query gives a significant Kolmogorov-Smirnov association and anticorrelation that is visible from the graph, in absolute distances. (B) Correlation in the reverse direction suggests no significant relationship between the two classes of sites.
Figure 3
Figure 3. A schematic of the various tests implemented in the software package, showing when certain tests are most useful.
(A) depicts the intervals created in silico and (B) shows how the query distances are evaluated within the intervals. (C) depicts a random distribution of query versus reference intervals; here the observed and expected distances for both the absolute and relative tests are the same. In (D) we show a relationship best uncovered by the absolute distance test; useful especially for small genomes, this test determines whether the query and reference are often separated by a fixed distance. In (E), the query points are consistently far away from the reference points, so the relative distance test will be significant, while the absolute distances are not significant in this case. Interestingly, the query intervals are variable enough in size that even though the query and reference points are usually separated, the absolute distances between them vary widely in size, including some fairly small distances. (F) demonstrates the projection test, which evaluates whether pointwise data falls consistently inside or outside of a set of intervals. Finally, in (G) we see the Jaccard test, which looks for significant overlaps between datasets by evaluating the ratio of the intersection of the datasets (dark grey) to the union of the datasets (light grey). Perfect correlation will give a ratio of 1, and perfect anticorrelation will result in a ratio of zero.
Figure 4
Figure 4. Alu elements vs splice sites in the graphics.plot() output (A) and in the visualize() output (B).
Alu elements are consistently located at a variable but always nonzero distance from splice sites. Query and reference colors as in Figure 1.
Figure 5
Figure 5. A toy example of absolute distance correlation.
(A) Histograms of the observed and expected ranges of minimum distances between the reference and query. (B) GenometriCorr's simple plot for the same data. Query and reference colors as in Figure 1.
Figure 6
Figure 6. Promoter positions from highly expressed genes (as given from mRNAseq data) and histone ChIP data recently available from the Roadmap Epigenomics Project .
(A) H3K4me3 versus highly expressed genes. (B) H3K27me3 versus highly expressed genes. Query and reference colors as in Figure 1.
Figure 7
Figure 7. Human genomic CpG islands from Wu et al correlated with the positions of coding sequences in the human genome.
Query and reference colors as in Figure 1.
Figure 8
Figure 8. Ty1 retrotransposon insertion sites vs tRNA genes in the yeast genome.
(A) ECDF plots (B) Graphic display. Arrows mark Ty1 insertion sites at nucleosome-occupied positions near tRNA genes. Nucleosomes are in green. The colored graph contains several regions of high observed/expected Ty1 insertions (red colors), and the black line indicates a high density of Ty1 insertions, as well, in these regions. Relative to the tRNA position, the Ty1 insertion sites are most dense inside the nucleosome occupied regions. Query and reference colors as in Figure 1.
Figure 9
Figure 9. A) The Galaxy interface to GenometriCorr. B) The Tk interface to GenometriCorr.
Instructions for using both are found on the website.

Similar articles

See all similar articles

Cited by 68 articles

See all "Cited by" articles

References

    1. Bird AP. CpG-rich islands and the function of DNA methylation. Nature. 1986;321:209–213. - PubMed
    1. Giles KE, Gowher H, Ghirlando R, Jin C, Felsenfeld G. Chromatin boundaries, insulators, and long-range interactions in the nucleus. Cold Spring Harb Symp Quant Biol. 2010;75:79–85. - PubMed
    1. Bickel PJ, Brown JB, Huang H, Li Q. An overview of recent developments in genomics and associated statistical methods. Philos Transact A Math Phys Eng Sci. 2009;367:4313–4337. - PubMed
    1. Bickel PJ, Boley N, Brown JB, Huang H, Zhang NR. Subsampling methods for genomic inference. Ann Appl Stat. 2010;4:1660–1660–1697.
    1. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, et al. Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol. 2004;5:R80. - PMC - PubMed

Publication types

LinkOut - more resources

Feedback