Identification of factors associated with duplicate rate in ChIP-seq data

Shulan Tian; Shuxia Peng; Michael Kalmbach; Krutika S Gaonkar; Aditya Bhagwate; Wei Ding; Jeanette Eckel-Passow; Huihuang Yan; Susan L Slager

doi:10.1371/journal.pone.0214723

Identification of factors associated with duplicate rate in ChIP-seq data

PLoS One. 2019 Apr 3;14(4):e0214723. doi: 10.1371/journal.pone.0214723. eCollection 2019.

Authors

Shulan Tian¹, Shuxia Peng¹, Michael Kalmbach², Krutika S Gaonkar¹, Aditya Bhagwate¹, Wei Ding³, Jeanette Eckel-Passow¹, Huihuang Yan¹, Susan L Slager¹

Affiliations

¹ Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, United States of America.
² Division of Research and Education Support Systems, Department of Information Technology, Mayo Clinic, Rochester, Minnesota, United States of America.
³ Division of Hematology, Mayo Clinic, Rochester, Minnesota, United States of America.

Abstract

Chromatin immunoprecipitation and sequencing (ChIP-seq) has been widely used to map DNA-binding proteins, histone proteins and their modifications. ChIP-seq data contains redundant reads termed duplicates, referring to those mapping to the same genomic location and strand. There are two main sources of duplicates: polymerase chain reaction (PCR) duplicates and natural duplicates. Unlike natural duplicates that represent true signals from sequencing of independent DNA templates, PCR duplicates are artifacts originating from sequencing of identical copies amplified from the same DNA template. In analysis, duplicates are removed from peak calling and signal quantification. Nevertheless, a significant portion of the duplicates is believed to represent true signals. Obviously, removing all duplicates will underestimate the signal level in peaks and impact the identification of signal changes across samples. Therefore, an in-depth evaluation of the impact from duplicate removal is needed. Using eight public ChIP-seq datasets from three narrow-peak and two broad-peak marks, we tried to understand the distribution of duplicates in the genome, the extent by which duplicate removal impacts peak calling and signal estimation, and the factors associated with duplicate level in peaks. The three PCR-free histone H3 lysine 4 trimethylation (H3K4me3) ChIP-seq data had about 40% duplicates and 97% of them were within peaks. For the other datasets generated with PCR amplification of ChIP DNA, as expected, the narrow-peak marks have a much higher proportion of duplicates than the broad-peak marks. We found that duplicates are enriched in peaks and largely represent true signals, more conspicuous in those with high confidence. Furthermore, duplicate level in peaks is strongly correlated with the target enrichment level estimated using nonredundant reads, which provides the basis to properly allocate duplicates between noise and signal. Our analysis supports the feasibility of retaining the portion of signal duplicates into downstream analysis, thus alleviating the limitation of complete deduplication.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Cell Line, Tumor
Chromatin Immunoprecipitation Sequencing / methods*
Data Analysis
Datasets as Topic
HeLa Cells
Humans
MCF-7 Cells
Polymerase Chain Reaction
Reproducibility of Results

Abstract

Publication types

MeSH terms

Grants and funding