Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Jan;23(1):129-41.
doi: 10.1101/gr.136739.111. Epub 2012 Oct 23.

Modeling kinetic rate variation in third generation DNA sequencing data to detect putative modifications to DNA bases

Affiliations

Modeling kinetic rate variation in third generation DNA sequencing data to detect putative modifications to DNA bases

Eric E Schadt et al. Genome Res. 2013 Jan.

Abstract

Current generation DNA sequencing instruments are moving closer to seamlessly sequencing genomes of entire populations as a routine part of scientific investigation. However, while significant inroads have been made identifying small nucleotide variation and structural variations in DNA that impact phenotypes of interest, progress has not been as dramatic regarding epigenetic changes and base-level damage to DNA, largely due to technological limitations in assaying all known and unknown types of modifications at genome scale. Recently, single-molecule real time (SMRT) sequencing has been reported to identify kinetic variation (KV) events that have been demonstrated to reflect epigenetic changes of every known type, providing a path forward for detecting base modifications as a routine part of sequencing. However, to date no statistical framework has been proposed to enhance the power to detect these events while also controlling for false-positive events. By modeling enzyme kinetics in the neighborhood of an arbitrary location in a genomic region of interest as a conditional random field, we provide a statistical framework for incorporating kinetic information at a test position of interest as well as at neighboring sites that help enhance the power to detect KV events. The performance of this and related models is explored, with the best-performing model applied to plasmid DNA isolated from Escherichia coli and mitochondrial DNA isolated from human brain tissue. We highlight widespread kinetic variation events, some of which strongly associate with known modification events, while others represent putative chemically modified sites of unknown types.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Reproducible variation in interpulse durations as a surrogate for chemical modifications to nucleotide bases. (A) Sample traces of six DNA molecules, three in which the DNA template contains a single 8-oxoG modification (right three traces), and the other identical but with no 8-oxoG modification (left three traces). While the IPD is observed to vary significantly even within the same modification state (a consequence of the exponential nature of the IPD), in the case of the 8-oxoG residue the IPDs are seen to be generally longer than the IPDs of the unmodified G residue. (B) After examining hundreds of molecules in which the G residue was modified versus unmodified, the consistent lengthening of the mean IPD for the modified G residue compared with the mean IPD for the unmodified G residue becomes statistically significant (red bar). The effect of the 8-oxoG modification to the G residue is also seen to affect the IPDs of the neighboring bases in a statistically significant way. In this case, the P-value indicated at each position was computed using the Mann-Whitney test.
Figure 2.
Figure 2.
DNA polymerase kinetics in SMRT sequencing is a function of the local sequence context of the incorporation site, motivating a conditional random field approach to KVE detection. (A) Heatmap of the coefficient of determination (R2) for the IPD variance for the incorporation site of a SMRT sequencing reaction explained by local sequence context. This heatmap suggests that seven bases upstream of and two bases downstream from the incorporation site are the most informative, and that bases beyond this context do not provide much additional information about the enzyme kinetics. (B) Scatter plot comparing IPDs in identical sequence contexts between whole-genome amplified E. coli and M. genitalium samples. Each point represents the log of the IPD for a given 10-bp context (seven bases upstream of and two bases downstream from the incorporation site) in E. coli (y-axis) and M. genitalium (x-axis): 2500 points sampled from the 1,048,576 possible 10-mer contexts are shown here for ease of viewing. The strong correlation (Pearson's correlation coefficient = 0.91) between IPDs in identical contexts assayed from completely independent sequencing runs of different species demonstrate that the context effects are highly consistent between experiments. (C) Graphical representation of the CRF model. The formula image variables represent the hidden modification states for site i, while the formula image represent the observed IPD values for site i that inform on the modification status of the site. In this model we are considering interactions between the incorporation site, formula image, and the two nearest neighboring sites on each side of formula image. The edges between the formula image variables indicate there can be interactions between the local sites, with the formula image parameters representing the degree of interaction among the nodes. The formula image parameters represent the exponential rates for the two possible rate classes at each position i (formula image), while the formula image parameters represent the proportion of molecules in state k at position i (with formula image).
Figure 3.
Figure 3.
Detecting kinetic variation events using different models derived from the full CRF model. (A) Plasmid pRRS depicted as a circos plot, with the inside of the annulus representing the coordinates of the plasmid, the blue hash marks indicating C residues in a GATC context, and the two red curves representing –log10(P-value) for the single-site likelihood model for the two DNA strands. The P-values are based on 475-fold filtered coverage of the plasmid genome. In this case, at a 5% FDR threshold, all methylated sites in the GATC context were detected and no other sites outside of the GATC context were detected. (B) Receiver operator characteristic (ROC) curves for the supervised models described in the text applied to the M.Sau3AI plasmid and control data, with false-positive rate (FPR) plotted along the x-axis, and true-positive rate (TPR) plotted along the y-axis. The [−1,+1], [−2,+1], [−3,+1], and [−4,+1] labels in the legend indicate the window size and position with respect to the test site (at 0 in each interval) to which the multisite model was fitted. (C) ROC curves for the unsupervised models described in the text applied to the M.Sau3AI plasmid data only. The ROC curves dipping below the diagonal results from the relatively small number of true positive sites (relative to all sites tested) and that these sites were detected at a lower rate compared with false-positive sites. (D) ROC curves for the unsupervised models applied to the 8-oxoG data.
Figure 4.
Figure 4.
Kinetic variation events detected in the mitochondrial genome. (A) The bottom left circular plot is an annotation of the mitochondrial genome with respect to the genes found on the heavy and light strands. The larger circular plot indicates the –log10 P-values for each position tested on the mitochondrial genome, with inside and outside of the circle representing the heavy and light strands, respectively. (B) Putative 8-oxoG event detected at position 4186 in the mitochondrial genome (heavy strand). (Left) The IPDs values are shown for six molecules from the neuronal mtDNA sample, with each molecule read five to 10 times within individual SMRTbells for each molecule. The color coding reflects the IPD value, with dark blue indicating IPDs <0.5 sec and dark red indicating IDPs >3.0 sec. Molecules 1–3 indicate highly variable IPDs with long IPDs represented, expected if the IPD distribution is exponentially distributed with a high IPD mean. Molecules 4–6 have significantly lower IPD values compared with molecules 1–3. These data suggest that molecules 1–3 are modified at this position compared with molecules 4–6. (Right) The mean IPD computed for each position within each molecule. The mean values at the highlighted test position are clearly different between molecules 1–3 and molecules 4–6, indicating why this site was detected as a kinetic variation event. None of the sites within 10 bases of this test site were detected as kinetic variation events. (C) DNA samples with (right) and without (left) evidence of modification were treated with a glycosylase to create single-strand breaks at oxidatively modified positions. The samples were PCR amplified before and after treatment to demonstrate the degree of modification. First-derivative plots of the amplification are shown.

Comment in

Similar articles

Cited by

References

    1. Alexander RP, Fang G, Rozowsky J, Snyder M, Gerstein MB 2010. Annotating non-coding regions of the genome. Nat Rev Genet 11: 559–571 - PubMed
    1. Bashir A, Klammer A, Robins WP, Chen CS, Webster D, Paxinos E, Hsu D, Ashby M, Wang S, Peluso P, et al. 2012. A hybrid approach for the automated finishing of bacterial genomes. Nat Biotechnol 30: 701–707 - PMC - PubMed
    1. Beal MF 2005. Mitochondria take center stage in aging and neurodegeneration. Ann Neurol 58: 495–505 - PubMed
    1. Chin CS, Sorenson J, Harris JB, Robins WP, Charles RC, Jean-Charles RR, Bullard J, Webster DR, Kasarskis A, Peluso P, et al. 2011. The origin of the Haitian cholera outbreak strain. N Engl J Med 364: 33–42 - PMC - PubMed
    1. Clark TA, Murray IA, Morgan RD, Kislyuk AO, Spittle K, Boitano M, Fomenkov A, Roberts RJ, Korlach J 2011. Characterization of DNA methyltransferase specificities using single-molecule, real-time DNA sequencing. Nucleic Acids Res 40: e29 doi: 10.1093/nar/gkr1146 - PMC - PubMed

Publication types

LinkOut - more resources