Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jan 22;21(1):17.
doi: 10.1186/s13059-019-1924-8.

A Curated Benchmark of Enhancer-Gene Interactions for Evaluating Enhancer-Target Gene Prediction Methods

Affiliations
Free PMC article

A Curated Benchmark of Enhancer-Gene Interactions for Evaluating Enhancer-Target Gene Prediction Methods

Jill E Moore et al. Genome Biol. .
Free PMC article

Abstract

Background: Many genome-wide collections of candidate cis-regulatory elements (cCREs) have been defined using genomic and epigenomic data, but it remains a major challenge to connect these elements to their target genes.

Results: To facilitate the development of computational methods for predicting target genes, we develop a Benchmark of candidate Enhancer-Gene Interactions (BENGI) by integrating the recently developed Registry of cCREs with experimentally derived genomic interactions. We use BENGI to test several published computational methods for linking enhancers with genes, including signal correlation and the TargetFinder and PEP supervised learning methods. We find that while TargetFinder is the best-performing method, it is only modestly better than a baseline distance method for most benchmark datasets when trained and tested with the same cell type and that TargetFinder often does not outperform the distance method when applied across cell types.

Conclusions: Our results suggest that current computational methods need to be improved and that BENGI presents a useful framework for method development and testing.

Keywords: Benchmark; Enhancer; Genomic interactions; Machine learning; Target gene; Transcriptional regulation.

Conflict of interest statement

Z. Weng is a cofounder of Rgenta Therapeutics and she serves on its scientific advisory board.

Figures

Fig. 1
Fig. 1
A benchmark of candidate enhancer-gene interactions (BENGI). a Experimental datasets used to curate BENGI interactions categorized by 3D chromatin interactions, genetic interactions, and CRISPR/Cas9 perturbations. b Methods of generating cCRE-gene pairs (dashed straight lines in green, shaded green, or red) from experimentally determined interactions or perturbation links (dashed, shaded arcs in red, pink, or gold). Each cCRE-gene pair derived from 3D chromatin interactions (top panel) has a cCRE-ELS (yellow box) intersecting one anchor of a link, and the pair is classified depending on the other anchor of the link: for a positive pair (dashed green line), the other anchor overlaps one or more TSSs of just one gene; for an ambiguous pair (dashed line with gray shading), the other anchor overlaps the TSSs of multiple genes; for a negative pair (dashed red line), the other anchor does not overlap with a TSS. Each cCRE-gene pair derived from genetic interactions or perturbation links (middle and bottom panels) has a cCRE-ELS (yellow box) intersecting an eQTL SNP or a CRISPR-targeted region, and the pair is classified as positive (dashed green line) if the gene is an eQTL or crisprQTL gene, while all the pairs that this cCRE forms with non-eQTL genes that have a TSS within the distance cutoff are considered negative pairs (dashed red line). c To reduce potential false positives obtained from 3D interaction data, we implemented a filtering step to remove ambiguous pairs (gray box in b) that link cCREs-ELS to more than one gene. This filtering step was not required for assays that explicitly listed the linked gene (eQTLs and crisprQTLs). Additionally, for comparisons between BENGI datasets, we also curated matching sets of interactions with a fixed positive-to-negative ratio. Therefore, a total of four BENGI datasets were curated for each 3D chromatin experiment (A, B, C, D), and two were curated for each genetic interaction and CRISPR/Cas-9 perturbation experiment (A, B). d To avoid overfitting of machine-learning algorithms, all cCRE-gene pairs were assigned to cross-validation (CV) groups based on their chromosomal locations. Positive and negative pairs on the same chromosome were assigned to the same CV group, and chromosomes with complementary sizes were assigned to the same CV group so that the groups contained approximately the same number of pairs
Fig. 2
Fig. 2
Characteristics of BENGI datasets. Six datasets in GM12878 or other LCLs were evaluated: RNAPII ChIA-PET (red), CTCF ChIA-PET (orange), Hi-C (green), CHi-C (blue), GEUVADIS eQTLs (purple), and GTEx eQTLs (pink), and the same color scheme is used for all panels. a Heatmap depicting the overlap coefficients between positive cCRE-gene pairs in each BENGI dataset. The datasets were clustered using the hclust algorithm, and the clustered datasets are outlined in black. b Violin plots depicting the distance distributions of positive cCRE-gene pairs for each BENGI dataset. The 95th percentile of each distribution is indicated by a star and presented above each plot. c Violin plots depicting the expression levels of genes in positive cCRE-gene pairs (in transcripts per million, TPM). d Violin plots depicting CTCF signal levels at cCREs-ELSs in positive cCRE-gene pairs. A dashed box indicates cCREs-ELS with a signal > 5. e Distributions of the number of genes positively linked with a cCRE-ELS across datasets
Fig. 3
Fig. 3
Evaluation of unsupervised methods for predicting cCRE-gene pairs. a Precision-recall (PR) curves for four unsupervised methods evaluated on RNAPII ChIA-PET pairs in GM12878: distance between cCREs-ELS and genes (gray), DNase-DNase correlation by Thurman et al. (green), DNase-expression correlation by Sheffield et al. (purple), and the average rank of the distance and the DNase-expression method (black). The areas under the PR curve (AUPRs) for the four methods are listed in the legend. The AUPR for a random method is indicated with a dashed line at 0.15. b The AUPRs for the four unsupervised methods are computed for each of the six benchmark datasets from LCLs. c Genome browser view (chr6:88,382,922-88,515,031) of epigenomic signals and positive BENGI links (RNAPII ChIA-PET in red, Hi-C in green, CHi-C in blue, and GEUVADIS eQTL in pink) connecting the EH37E0853090 cCRE (star) to the AKIRIN2 gene. d Scatter plot of normalized AKIRIN2 expression vs. the normalized DNase signal at EH37E0853090 as calculated by Sheffield et al. (Pearson correlation coefficient = 0.16). Although AKIRIN2 is highly expressed across many tissues, EH37E0853090 presents high DNase signals primarily in lymphoblastoid cell lines (purple triangles), resulting in a low correlation
Fig. 4
Fig. 4
Evaluation of supervised learning methods for predicting cCRE-gene pairs. a PR curves for three supervised methods evaluated using RNAPII ChIA-PET pairs in GM12878: PEP-motif (green) and two versions of TargetFinder (full model in darker blue and core model in lighter blue). For comparison, two unsupervised methods presented in Fig. 3 (the distance (gray) and average-rank (black) methods) are also shown along with the AUPR for a random method (dashed line at 0.15). The AUPRs for the methods are listed in the legend. b AUPRs for the three supervised methods, two unsupervised methods, and a random approach, colored as in a, for each of the six BENGI datasets from LCLs. c Scatter plot of AUPRs for TargetFinder (triangles) and PEP-motif (circles) across the BENGI datasets evaluated using 12-fold random CV (X-axis) vs. chromosome-based CV (Y-axis). The diagonal dashed line indicates X = Y. d Schematic diagram for the full and core4 TargetFinder models
Fig. 5
Fig. 5
Evaluation of supervised learning methods trained in one cell type and tested in another cell type. AUPRs for the distance (gray), average-rank (black), and TargetFinder core4 (purple) methods across a RNAPII ChIA-PET, b CTCF ChIA-PET, c CHi-C, d Hi-C, and e GTEx eQTL pairs. The cell type used for training is indicated in the panel title, and the cell type used for testing is indicated on the X-axis. The best-performing method for each dataset is indicated by a star, and random performance is indicated with a dashed line

Similar articles

See all similar articles

Cited by 1 article

References

    1. Ernst J, Kheradpour P, Mikkelsen TS, Shoresh N, Ward LD, Epstein CB, et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature. 2011;473:43–49. doi: 10.1038/nature09906. - DOI - PMC - PubMed
    1. ENCODE Project Consortium. Bernstein BE, Birney E, Dunham I, Green ED, Gunter C, et al. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. doi: 10.1038/nature11247. - DOI - PMC - PubMed
    1. Hoffman MM, Buske OJ, Wang J, Weng Z, Bilmes JA, Noble WS. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nat Methods. 2012;9:473–476. doi: 10.1038/nmeth.1937. - DOI - PMC - PubMed
    1. Rajagopal N, Xie W, Li Y, Wagner U, Wang W, Stamatoyannopoulos J, et al. RFECS: A Random-Forest Based Algorithm for Enhancer Identification from Chromatin State. Singh M, editor. PLoS Comput Biol Public Libr Sci. 2013;9:e1002968. doi: 10.1371/journal.pcbi.1002968. - DOI - PMC - PubMed
    1. Roadmap Epigenomics Consortium. Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, et al. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518:317–330. doi: 10.1038/nature14248. - DOI - PMC - PubMed

Publication types

LinkOut - more resources

Feedback