Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Sep 14;8(1):535.
doi: 10.1038/s41467-017-00478-8.

Identifying Topologically Associating Domains and Subdomains by Gaussian Mixture Model And Proportion Test

Affiliations
Free PMC article

Identifying Topologically Associating Domains and Subdomains by Gaussian Mixture Model And Proportion Test

Wenbao Yu et al. Nat Commun. .
Free PMC article

Abstract

The spatial organization of the genome plays a critical role in regulating gene expression. Recent chromatin interaction mapping studies have revealed that topologically associating domains and subdomains are fundamental building blocks of the three-dimensional genome. Identifying such hierarchical structures is a critical step toward understanding the three-dimensional structure-function relationship of the genome. Existing computational algorithms lack statistical assessment of domain predictions and are computationally inefficient for high-resolution Hi-C data. We introduce the Gaussian Mixture model And Proportion test (GMAP) algorithm to address the above-mentioned challenges. Using simulated and experimental Hi-C data, we show that domains identified by GMAP are more consistent with multiple lines of supporting evidence than three state-of-the-art methods. Application of GMAP to normal and cancer cells reveals several unique features of subdomain boundary as compared to domain boundary, including its higher dynamics across cell types and enrichment for somatic mutations in cancer.Spatial organization of the genome plays a crucial role in regulating gene expression. Here the authors introduce GMAP, the Gaussian Mixture model And Proportion test, to identify topologically associating domains and subdomains in Hi-C data.

Conflict of interest statement

The authors declare no competing financial interests.

Figures

Fig. 1
Fig. 1
Overview of the GMAP method. The method consists of three major steps. In step one, we fit a Gaussian mixture model with two components representing chromatin interactions within and outside of a domain. In step two, for each genomic bin, we determine if it is located at the boundary of blocks of dense chromatin interactions by performing a proportion test of observed contact counts within and between windows flanking the bin. In step three, we call chromatin domains based on the location and orientation of the candidate boundaries identified in step two
Fig. 2
Fig. 2
Performance comparison using simulated data. Hi-C contact count matrices were simulated using Poisson distribution. a Overall similarity between predicted and true domains measured using the Variation of Information (VI) index. b Overall similarity between predicted and true domains measured using the Jaccard Index. Shown are boxplots of VI and Jaccard indices over 100 simulations. The whiskers represent the most extreme data point which is no more than 1.5 times the interquartile range. Paired t-test was used to compare the performance metrics (VI or Jaccard index) for different methods. P-values are based on paired t-test. c An example of called TADs by different methods using simulated Hi-C data without embedded sub-TADs. Called domains are outlined by solid black lines. d An example of called TADs by different methods using simulated Hi-C data with embedded TADs and sub-TADS
Fig. 3
Fig. 3
Performance comparison using experimental Hi-C data. a, b Similarity between TADs called using low-resolution (10 kb) and high-resolution (10 kb) Hi-C data. Hi-C data for the human lung fibroblast cell line, IMR90, was obtained from refs , . Similarity was measured using Variation of Information (left) and Jaccard Index (right). c Similarity between subTADs called using Hi-C and 5 C data. No data is shown for HiCseg and metaTAD since they do not call subTADs. Average number of CTCF peaks d, Rad21 peaks e, Pol2 peaks f, and H3K4me3 peaks g per TAD boundary. Values represent the average number of peaks within a TAD boundary plus 25 kb flanking regions on either side of the boundary across all chromosomes and six cell types (IMR90, GM12878, NHEK, HMEC, HUVEC, and K562). P-values are based on paired t-test. The whiskers represent the most extreme data point which is no more than 1.5 times the interquartile range. h Running speed of different methods
Fig. 4
Fig. 4
SubTAD boundaries are more dynamic than TAD boundaries across different cell lines. a Pairwise similarity of TADs across six human cell lines, GM12878, HMEC, HUVEC, IMR90, K562, NHEK. In all but four pairwise comparisons, the difference between GMAP and the other three methods is significantly different based on t-test (P < 0.05). The whiskers represent the most extreme data point which is no more than 1.5 times the interquartile range. b Comparison of pairwise similarities for TADs and subTADs across six human cell lines. The cumulative probability plots show that subTAD boundaries are more dynamic across all pairwise comparisons of the six cell lines. In all pairwise comparisons, the cumulative distribution for subTADs is significantly different than the distribution for TADs based on KS test (P < 0.05)
Fig. 5
Fig. 5
Relationship between hierarchical domain boundary, somatic mutation, and enhancer–promoter communication. a, b SubTAD but not TAD boundaries are enriched for somatic mutations in cancer. Percentage of TAD and subTAD boundaries overlapping with at least one recurrent mutations for MCF10A cells a and PrEC cells b. TAD and subTADs were identified using Hi-C data for non-tumorigenic mammary gland epithelial cell line (MCF10A) and prostate epithelial cell line (PrEC). Solid line, TADs; Dashed line: subTADs. Observed percentages are indicated by vertical lines with an arrow. Distributions of expected percentages are generated using 10,000 sets of randomly selected genomic regions with the same number and size as the called TADs/subTADs for each cancer cell type. c SubTAD boundaries are more dynamics than TAD boundaries between cancer and normal cells. MCF7, breast adenocarcinoma cell line; LNCaP, prostate carcinoma cell line; PC3, prostate adenocarcinoma cell line. d Schematic demonstrating that cell-type-specific enhancer–promoter (EP) interactions are blocked by newly formed domain boundary in the other cell type. e Proportion of cell-type-specific domain boundaries that overlap with cell-type-specific EP interactions in the other cell type. P-value is based on t-test. f Expression levels of promoters involved in cell-specific EP interactions that are blocked by TAD boundary in the other cell type. g Expression levels of promoters involved in cell-specific EP interactions that are blocked by subTAD boundary in the other cell type. P-values are based on t-test. The whiskers represent the most extreme data point which is no more than 1.5 times the interquartile range

Similar articles

See all similar articles

Cited by 11 articles

See all "Cited by" articles

References

    1. Dixon JR, et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature. 2012;485:376–380. doi: 10.1038/nature11082. - DOI - PMC - PubMed
    1. Shen Y, et al. A map of the cis-regulatory sequences in the mouse genome. Nature. 2012;488:116–120. doi: 10.1038/nature11243. - DOI - PMC - PubMed
    1. Nora EP, et al. Spatial partitioning of the regulatory landscape of the X-inactivation centre. Nature. 2012;485:381–385. doi: 10.1038/nature11049. - DOI - PMC - PubMed
    1. Rao SS, et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2014;159:1665–1680. doi: 10.1016/j.cell.2014.11.021. - DOI - PMC - PubMed
    1. Filippova D, Patro R, Duggal G, Kingsford C. Identification of alternative topological domains in chromatin. Algorithms Mol. Biol. 2014;9:14. doi: 10.1186/1748-7188-9-14. - DOI - PMC - PubMed

Publication types

Feedback