Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Apr 21;12(4):e1004873.
doi: 10.1371/journal.pcbi.1004873. eCollection 2016 Apr.

CNVkit: Genome-Wide Copy Number Detection and Visualization From Targeted DNA Sequencing

Affiliations
Free PMC article

CNVkit: Genome-Wide Copy Number Detection and Visualization From Targeted DNA Sequencing

Eric Talevich et al. PLoS Comput Biol. .
Free PMC article

Abstract

Germline copy number variants (CNVs) and somatic copy number alterations (SCNAs) are of significant importance in syndromic conditions and cancer. Massively parallel sequencing is increasingly used to infer copy number information from variations in the read depth in sequencing data. However, this approach has limitations in the case of targeted re-sequencing, which leaves gaps in coverage between the regions chosen for enrichment and introduces biases related to the efficiency of target capture and library preparation. We present a method for copy number detection, implemented in the software package CNVkit, that uses both the targeted reads and the nonspecifically captured off-target reads to infer copy number evenly across the genome. This combination achieves both exon-level resolution in targeted regions and sufficient resolution in the larger intronic and intergenic regions to identify copy number changes. In particular, we successfully inferred copy number at equivalent to 100-kilobase resolution genome-wide from a platform targeting as few as 293 genes. After normalizing read counts to a pooled reference, we evaluated and corrected for three sources of bias that explain most of the extraneous variability in the sequencing read depth: GC content, target footprint size and spacing, and repetitive sequences. We compared the performance of CNVkit to copy number changes identified by array comparative genomic hybridization. We packaged the components of CNVkit so that it is straightforward to use and provides visualizations, detailed reporting of significant features, and export options for integration into existing analysis pipelines. CNVkit is freely available from https://github.com/etal/cnvkit.

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. CNVkit workflows.
The target and off-target bin BED files and reference file are constructed once for a given platform and can be used to process many samples sequenced on the same platform, as shown in the workflow on the left. Steps to construct the off-target bins are shown at the top-right, and construction of the reference is shown at the lower-right.
Fig 2
Fig 2. Baited region size and spacing affect read depth systematically.
A: Example of typical coverage observed at a targeted exon, as viewed in IGV, and simplified geometric models of the negative coverage biases (yellow) that can occur as a function of the relative sizes of sequence fragments and the baited region. B: Coverage observed at two neighboring targeted exons, and models of the positive coverage biases (red) that can occur where intervals are separated by less than half the insert size of sequence fragments.
Fig 3
Fig 3. Bin read depths are systematically biased by GC content and other factors.
A: GC coverage bias follows a unimodal distribution in sample TR_37_T. Target bins are sorted according to bin GC fraction (x-axis), and the uncorrected, median-centered log2 bin read depths are plotted (y-axis). A rolling median of the bin log2 read depths in order of GC value is drawn in red, showing a systematic deviation from 0 in the selected sample. B: Trendlines summarize each bias type in each sample. TR and EX samples are shown in the top and bottom rows, respectively. Columns show biases due to GC content in target bins and off-target bins, repeat content in off-target bins, and density bias in target bins.
Fig 4
Fig 4. Bias corrections reduce the extraneous variation in bin read depths.
Distributions of the absolute deviation of on– and off-target bins from the final, segmented copy ratio estimates are shown as box plots at each step of bias correction for all samples in the TR and EX sequencing cohorts. At each step, for on- and off-target bins separately, boxes show the median and interquartile range of absolute deviations and whiskers show the 95% range. Steps shown are the initial median-centered log2 read depth (“Raw”), correction of GC bias (“GC”), correction of on-target density and off-target repeat biases (“Density/Repeat”), and normalization to a pooled reference (“Reference“).
Fig 5
Fig 5. CNVkit copy ratios agree with experimental results array CGH and FISH on cell line DNA.
A: Whole-genome profiles of log2 copy ratio by CNVkit (top) and array CGH (bottom) are shown. B: Genes additionally assayed by FISH are labeled with the detected absolute copy number. At CDKN2A, log2 ratios below the marked level of -3.58 indicate the site is entirely deleted in the majority of cells.
Fig 6
Fig 6. Comparison of CNVkit and other methods to array CGH.
Log2 ratio estimates by CNVkit, CONTRA and CopywriteR were compared to those by array CGH at each of the targeted genes in the TR and EX cohorts as well as the C0902 cell line sample (CL). The distribution of differences of segmented log2 ratio estimates by each caller from that of array CGH at each targeted gene is shown as a box plot, where each box shows the median and interquartile range of absolute deviations, whiskers show the 95% range, and the magnitide of the 95% range (prediction interval) is printed under the box plot. Columns are CNV callers, and rows are the TR and EX cohorts and C0902 sample on which the callers were evaluated.
Fig 7
Fig 7. Precion and recall of absolute copy number calls.
CNV calls obtained using each sequencing-based method are compared to those determined by array CGH to calculate precision and recall under several criteria for the C0902 cell line sample. Columns show detection of each copy number level versus the neutral hexaploid state. Rows show criteria for comparison: all CNVs, CNVs larger than 5 MB, CNVs smaller than 5MB, all CNV basepairs. Each subplot shows the calculated precision and recall of CNVkit, CopywriteR and CONTRA with each supported reference.

Similar articles

See all similar articles

Cited by 172 articles

See all "Cited by" articles

References

    1. Pinkel D, Segraves R, Sudar D, Clark S, Poole I, Kowbel D, et al. High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nature Genetics. 1998. October;20(2):207–11. 10.1038/2524 - DOI - PubMed
    1. Pinkel D, Albertson DG. Array comparative genomic hybridization and its applications in cancer. Nature Genetics. 2005. June;37 Suppl(May):S11–7. 10.1038/ng1569 - DOI - PubMed
    1. Yoon S, Xuan Z, Makarov V, Ye K, Sebat J. Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Research. 2009. September;19(9):1586–92. 10.1101/gr.092981.109 - DOI - PMC - PubMed
    1. Zhao M, Wang Q, Wang Q, Jia P, Zhao Z. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinformatics. 2013. January;14 Suppl 1(Suppl 11):S1 10.1186/1471-2105-14-S11-S1 - DOI - PMC - PubMed
    1. Dahl F, Stenberg J, Fredriksson S, Welch K, Zhang M, Nilsson M, et al. Multigene amplification and massively parallel sequencing for cancer mutation discovery. Proceedings of the National Academy of Sciences of the United States of America. 2007. May;104(22):9387–92. 10.1073/pnas.0702165104 - DOI - PMC - PubMed

Publication types

MeSH terms

Feedback