Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2015 Dec 9;6:10001.
doi: 10.1038/ncomms10001.

A Comprehensive Assessment of Somatic Mutation Detection in Cancer Using Whole-Genome Sequencing

Tyler S Alioto  1   2 Ivo Buchhalter  3   4 Sophia Derdak  1   2 Barbara Hutter  4 Matthew D Eldridge  5 Eivind Hovig  6   7 Lawrence E Heisler  8 Timothy A Beck  8 Jared T Simpson  8 Laurie Tonon  9 Anne-Sophie Sertier  9 Ann-Marie Patch  10   11 Natalie Jäger  3   12 Philip Ginsbach  3 Ruben Drews  3 Nagarajan Paramasivam  3 Rolf Kabbe  3 Sasithorn Chotewutmontri  13 Nicolle Diessl  13 Christopher Previti  13 Sabine Schmidt  13 Benedikt Brors  4 Lars Feuerbach  4 Michael Heinold  4 Susanne Gröbner  14 Andrey Korshunov  15 Patrick S Tarpey  16 Adam P Butler  16 Jonathan Hinton  16 David Jones  16 Andrew Menzies  16 Keiran Raine  16 Rebecca Shepherd  16 Lucy Stebbings  16 Jon W Teague  16 Paolo Ribeca  1   2 Francesc Castro Giner  1   2 Sergi Beltran  1   2 Emanuele Raineri  1   2 Marc Dabad  1   2 Simon C Heath  1   2 Marta Gut  1   2 Robert E Denroche  8 Nicholas J Harding  8 Takafumi N Yamaguchi  8 Akihiro Fujimoto  17 Hidewaki Nakagawa  17 Víctor Quesada  18 Rafael Valdés-Mas  18 Sigve Nakken  6 Daniel Vodák  6   19 Lawrence Bower  5 Andrew G Lynch  5 Charlotte L Anderson  5   20 Nicola Waddell  10   11 John V Pearson  10   11 Sean M Grimmond  10   21 Myron Peto  22 Paul Spellman  22 Minghui He  23 Cyriac Kandoth  24 Semin Lee  25 John Zhang  25   26 Louis Létourneau  27 Singer Ma  28 Sahil Seth  26 David Torrents  29 Liu Xi  30 David A Wheeler  30 Carlos López-Otín  18 Elías Campo  31 Peter J Campbell  16 Paul C Boutros  9   32 Xose S Puente  18 Daniela S Gerhard  33 Stefan M Pfister  14   34 John D McPherson  8   32 Thomas J Hudson  8   32   35 Matthias Schlesner  3 Peter Lichter  36   37 Roland Eils  3   37   38   39 David T W Jones  34 Ivo G Gut  1   2
Affiliations
Free PMC article
Comparative Study

A Comprehensive Assessment of Somatic Mutation Detection in Cancer Using Whole-Genome Sequencing

Tyler S Alioto et al. Nat Commun. .
Free PMC article

Abstract

As whole-genome sequencing for cancer genome analysis becomes a clinical tool, a full understanding of the variables affecting sequencing analysis output is required. Here using tumour-normal sample pairs from two different types of cancer, chronic lymphocytic leukaemia and medulloblastoma, we conduct a benchmarking exercise within the context of the International Cancer Genome Consortium. We compare sequencing methods, analysis pipelines and validation methods. We show that using PCR-free methods and increasing sequencing depth to ∼ 100 × shows benefits, as long as the tumour:control coverage ratio remains balanced. We observe widely varying mutation call rates and low concordance among analysis pipelines, reflecting the artefact-prone nature of the raw data and lack of standards for dealing with the artefacts. However, we show that, using the benchmark mutation set we have created, many issues are in fact easy to remedy and have an immediate positive impact on mutation detection accuracy.

Figures

Figure 1
Figure 1. Differences between the different sample libraries.
Libraries A, E and G are PCR-free. (a) GC bias of the different libraries. The genome was segmented into 10-kb windows. For each window, the GC content was calculated and the coverage for the respective library was added. For better comparability, the coverage was normalized by dividing by the mean. The major band in normal corresponds to autosomes, while the lower band corresponds to sex chromosomes. The increased number of bands in the tumour is because of a higher number of ploidy states in the (largely) tetraploid tumour sample. (b) Cumulative coverage displayed for different libraries. Displayed are all libraries sequenced to at least 28 ×. To make the values comparable, we downsampled all samples to a coverage of 28 × (the lowest coverage of the initially sequenced libraries). The plot shows the percentage of the genome (y axis) covered with a given minimum coverage (x axis). (c) Percentage of certain regions of interest covered with less than 10 ×. Different colours are used to distinguish centres.
Figure 2
Figure 2. Effect of sequencing coverage on the ability to call SSMs.
(a) Overlap of SSMs called on different balanced coverages. (b) Density plots of the variant allele frequencies for different balanced coverages of tumour and control (tumour_versus_control) and number of SSMs called in total (calls were performed using the DKFZ calling pipeline, MB.I). (c) Plot of the number of SSMs (y axis) found for a given coverage (x axis). The different colours represent different levels of normal ‘contamination' in the tumour (0% black, 17% blue, 33% green and 50% orange). Solid lines represent the real data and dashed lines are simulated. Lines are fitted against the Michaelis–Menten model using the ‘drc' package in R. Solid lines are fitted to the data points and dashed lines are simulated using a mixed inhibition model for enzyme kinetics.
Figure 3
Figure 3. Overlap of somatic mutation calls for each level of concordance.
Shared sets of calls are vertically aligned. GOLD indicates the Gold Set. (a) Medulloblastoma SSM calls shared by at least two call sets. (b) Medulloblastoma SIM calls shared by at least two call sets.
Figure 4
Figure 4. Somatic mutation calling accuracy against Gold Sets.
Decreasing sensitivity on Tiers 1, 2 and 3 shown as series for each SSM call set, while precision remains the same. (a) Medulloblastoma SSMs. (b) Medulloblastoma SIMs.
Figure 5
Figure 5. Rainfall plot showing distribution of called mutations on the genome.
The distance between mutations is plotted in the log scale (y axis) versus the genomic position on the x axis. TPs (blue), FPs (green) and FNs (red). Four MB submissions representative of distinct patterns are shown. (a) MB.Q is one of best balanced between FPs and FNs, with low positional bias. (b) MB.L1 has many FNs. (c) MB.C has clusters of FPs near centromeres and FNs on the X chromosome. (d)MB.K has a high FP rate with short distance clustering of mutations.
Figure 6
Figure 6. Enrichment or depletion of genomic and alignment features in FP calls for each medulloblastoma SSM submission.
For each feature, the difference in frequency with respect to the Gold Set is multiplied by the FP rate. Blue indicates values less than zero and thus the proportion of variants or their score on that feature is lower in the FP set with respect to the true variants. Reddish colours correspond to a higher proportion of variants or higher scores for the feature in FP calls versus the Gold Set. Both features and submissions are clustered hierarchically. The features shown here include same AF (the probability that the AF in the tumour sample is not higher than that in the normal samples, derived from the snape-cmp-counts score), DacBL (in ENCODE DAC mappability blacklist region), DukeBL (in Encode Duke Mappability blacklist region), centr (in centromere or centromeric repeat), mult100 (1—mappability of 100mers with 1% mismatch), map150 (1—mappability of 150mers with 1% mismatch), DPNhi (high depth in normal), DPNlo (low depth in normal), dups (in high-identity segmental duplication), nestRep (in nested repeat), sRep (in simple repeat), inTR (in tandem repeat), adjTR (immediately adjacent to tandem repeat), msat (in microsatellite), hp (in or next to homopolymer of length >6), AFN (mutant AF in normal) and AFTlo (mutant AF in tumour<10%).
Figure 7
Figure 7. Accuracy of re-filtered pipeline SSM calls.
Unfiltered calls (MB.F0 and CLL.F0) are shown as a red squares, while the calls using the tuned filters (MB.F2 and CLL.F2) are shown as red circles for the medulloblastoma (a) and CLL (b) benchmark GOLD sets. For MB, only the recall versus Tier 3 is shown. Overall, 1,019 (81.2%) of the medulloblastoma SSMs (indicated by the dotted line) are considered callable at 40 × coverage; 236 MB SSMs (18.2%) were not called by any pipeline. For CLL, verification was carried out on SSMs originally called on the 40 × data, which explains the higher recall.

Similar articles

See all similar articles

Cited by 100 articles

See all "Cited by" articles

References

    1. Hudson T. J. et al. . International network of cancer genome projects. Nature 464, 993–998 (2010). - PMC - PubMed
    1. Mardis E. R. & Wilson R. K. Cancer genome sequencing: a review. Hum. Mol. Genet. 18, R163–R168 (2009). - PMC - PubMed
    1. Ley T. J. et al. . DNMT3A mutations in acute myeloid leukemia. N. Engl. J. Med. 363, 2424–2433 (2010). - PMC - PubMed
    1. Puente X. S. et al. . Whole-genome sequencing identifies recurrent mutations in chronic lymphocytic leukaemia. Nature 475, 101–105 (2011). - PMC - PubMed
    1. Alkodsi A., Louhimo R. & Hautaniemi S. Comparative analysis of methods for identifying somatic copy number alterations from deep sequencing data. Brief Bioinform. 16, 242–254 (2014). - PubMed

Publication types

Feedback