Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Feb 1;34(3):381-387.
doi: 10.1093/bioinformatics/btx595.

CGmapTools improves the precision of heterozygous SNV calls and supports allele-specific methylation detection and visualization in bisulfite-sequencing data

Affiliations

CGmapTools improves the precision of heterozygous SNV calls and supports allele-specific methylation detection and visualization in bisulfite-sequencing data

Weilong Guo et al. Bioinformatics. .

Abstract

Motivation: DNA methylation is important for gene silencing and imprinting in both plants and animals. Recent advances in bisulfite sequencing allow detection of single nucleotide variations (SNVs) achieving high sensitivity, but accurately identifying heterozygous SNVs from partially C-to-T converted sequences remains challenging.

Results: We designed two methods, BayesWC and BinomWC, that substantially improved the precision of heterozygous SNV calls from ∼80% to 99% while retaining comparable recalls. With these SNV calls, we provided functions for allele-specific DNA methylation (ASM) analysis and visualizing the methylation status on reads. Applying ASM analysis to a previous dataset, we found that an average of 1.5% of investigated regions showed allelic methylation, which were significantly enriched in transposon elements and likely to be shared by the same cell-type. A dynamic fragment strategy was utilized for DMR analysis in low-coverage data and was able to find differentially methylated regions (DMRs) related to key genes involved in tumorigenesis using a public cancer dataset. Finally, we integrated 40 applications into the software package CGmapTools to analyze DNA methylomes. This package uses CGmap as the format interface, and designs binary formats to reduce the file size and support fast data retrieval, and can be applied for context-wise, gene-wise, bin-wise, region-wise and sample-wise analyses and visualizations.

Availability and implementation: The CGmapTools software is freely available at https://cgmaptools.github.io/.

Contact: guoweilong@cau.edu.cn.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
SNV calling from BS-seq data by introducing wild-card genotypes. (A) Definition of an ATCGmap table for one position. Aw# is the read count of the Watson strand supporting the position as A on the Watson strand. Ac# is the read count of the Crick strand support the position as A on the Watson strand (T on the Crick strand). (B) Examples for genotype prediction from an ATCGmap table. Taking RRBS as an example, the upper case only covers one strand, and the read counts could be either from genotype T or from genotype C, considering the effects of bisulfite conversion, and therefore, introducing a wildcard in the genotype is necessary. The lower case has high coverage on both strands, and information from the reverse strand helps the inference of an explicit genotype. (C) The schema for the BinomWC strategy when both strands have sufficient coverages. Ambiguous read counts are added to corresponding positions in the table, and a binomial test is used to select a set of nucleotides from each strand; then the intersection of the two sets is used as the final predicted genotype. (D) The precision analysis for heterozygous SNV calling in simulated WGBS datasets for four strategies. The average coverage levels are 10×, 20×, 30×, 40× and 50×. (E) The precision analysis for heterozygous SNV calling in simulated RRBS datasets for four strategies
Fig. 2.
Fig. 2.
Allele-specific DNA methylation in humans and mice. (A) Scatter plot showing the percentage of ASM events and the number of heterozygous SNVs defined in both human and mouse samples. Round dots represent human samples and triangles represent mouse samples. (B) Enrichment analysis of ASM events in genomic elements showing different genomic bias within species. Colours indicate significance levels of enrichment by the hypergeometric test. (C) Heatmap showing the consistency of ASM among cell-types in both human (left panel) and mouse (right panel). Colours indicate significance level of consistency from low to high using the hypergeometric test. (D) Representative locus of ASM linked by a known SNP site in dbSNP130 located at 5’ UTR of FAM160A and in a CpG island. Reads linked by two heterozygous alleles were representatively selected in the Tanghulu plot for three brain samples and one oocyte sample. (E) Representative ASM locus linked by a heterozygous SNV site with C to T transition disrupting the CpG context of one allele, which is located in a known imprinting gene, H19. Reads linked by T, identified with a grey rectangular background, were ambiguous reads that could not be assigned to allele C or T due to bisulfite conversion. Open circles, unmethylated CpG sites; filled circles, methylated CpG sites
Fig. 3.
Fig. 3.
Differentially methylated region analysis in CGmapTools. (A) Schematic presentation for defining dynamic fragments. First, only sites covered by two samples are selected; Second, the genome is scanned by defining fragments with the minimal cytosine (usually CpG) counts n, the maximal fragment size S, and the maximal distance between two adjacent cytosines. Grey circles indicate cytosine sites that were only covered by one sample. Then, a t-test is applied to compare between methylation levels of cytosines in each fragment. Solid arrows indicate extending of a fragment, and dotted arrows indicate terminating the extension of a fragment. (B) Graphical presentation of the DMR and dynamic fragments (background) in a region on chr5. Data were from an eRRBS dataset. (C) Lollipop plot for the DMRs in the promoter region of gene VCAN. The arrow indicates the position in (B). The site-specific methylation levels are represented both by the height of bars. From the figure, two dynamic fragments (grey boxes) are reported as DMRs, which are hyper-methylated in AML
Fig. 4.
Fig. 4.
Flowchart for CGmapTools. CGmapTools accepts BAM file from BS-Seeker2 or Bismark, produces ATCGmap and CGmap files, and provides a set of functions derived from the two formats, such as SNV calling, coverage analysis, and methylation analysis. CGmapTools also defines the binary formats ATCGbz and CGbz, supporting rapid retrieval of data from large DNA methylome datasets
Fig. 5.
Fig. 5.
Graphs generated by CGmapTools. (A) Pie chart plots for DNA methylation contributions by different contexts in the sample hBrain_FcM55yNeun_Lis. (B) Bar plots for bulk DNA methylations in different contexts. (C) Distribution plot for DNA methylations in different contexts. (D) Distribution plots for mCG are shown in bins across the whole genome for single sample hOocyte_Rrbs_GuoH. (E) Heatmap plot for DNA methylation in bins across multiple samples. Average methylation levels of CpG are shown on the right, and a hierarchical clustering tree is built based on Spearman’s correlation coefficients. (F) Distribution plot of CpG methylation levels in fragmented regions across gene bodies. (G) The chromosome-wide MEC (left), density plot of MEC, and cumulative distribution of MEC (right) in AML sample

Similar articles

Cited by

References

    1. Akalin A. et al. (2012a) Base-pair resolution DNA methylation sequencing reveals profoundly divergent epigenetic landscapes in acute myeloid leukemia. PLoS Genet., 8, e1002781. - PMC - PubMed
    1. Akalin A. et al. (2012b) methylKit: a comprehensive R package for the analysis of genome-wide DNA methylation profiles. Genome Biol., 13, R87. - PMC - PubMed
    1. Almeida D. et al. (2017) Efficient detection of differentially methylated regions using DiMmeR. Bioinformatics (Oxford, England), 33, 549–551. - PubMed
    1. Benoukraf T. et al. (2013) GBSA: a comprehensive software for analysing whole genome bisulfite sequencing data. Nucleic Acids Res., 41, e55. - PMC - PubMed
    1. Chen G.G. et al. (2014) BisQC: an operational pipeline for multiplexed bisulfite sequencing. BMC Genomics, 15, 290. - PMC - PubMed

Publication types