Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Dec 2;43(21):e143.
doi: 10.1093/nar/gkv717. Epub 2015 Jul 17.

Sources of PCR-induced distortions in high-throughput sequencing data sets

Affiliations

Sources of PCR-induced distortions in high-throughput sequencing data sets

Justus M Kebschull et al. Nucleic Acids Res. .

Abstract

PCR permits the exponential and sequence-specific amplification of DNA, even from minute starting quantities. PCR is a fundamental step in preparing DNA samples for high-throughput sequencing. However, there are errors associated with PCR-mediated amplification. Here we examine the effects of four important sources of error-bias, stochasticity, template switches and polymerase errors-on sequence representation in low-input next-generation sequencing libraries. We designed a pool of diverse PCR amplicons with a defined structure, and then used Illumina sequencing to search for signatures of each process. We further developed quantitative models for each process, and compared predictions of these models to our experimental data. We find that PCR stochasticity is the major force skewing sequence representation after amplification of a pool of unique DNA amplicons. Polymerase errors become very common in later cycles of PCR but have little impact on the overall sequence distribution as they are confined to small copy numbers. PCR template switches are rare and confined to low copy numbers. Our results provide a theoretical basis for removing distortions from high-throughput sequencing data. In addition, our findings on PCR stochasticity will have particular relevance to quantification of results from single cell sequencing, in which sequences are represented by only one or a few molecules.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Errors and biases in PCR and their theoretical impact on sequence representation. (A) Left: structure of the amplicons used in this study. Two 20nt barcodes flank a constant sequence (AttL), forming a barcode pair. They are in turn flanked by Illumina P5 and P7 sites as well as sequencing primers (SolI and SolII). Right: sequence rank plot of experimental data set ZL037. A plateau, a broad shoulder and a long tail (partially visible) are apparent. Schematic representation of perfect (B) and different modes of skewed forms of PCR as well as their expected impact on sequencing data. PCR bias (C) and PCR stochasticity (D) skew the relative abundance of input sequences, but do not add any new sequences to the data set. In contrast, PCR template switching (E) and polymerase errors (F) generate novel sequences.
Figure 2.
Figure 2.
PCR can grossly affect sequence representation in Illumina library generation. Sequence rank plot of replicate data sets ZL037 and ZL052 before (A) and after (B) linear scaling in x and y to compensate for different input amounts and sequencing depth. Scale factors for the x and y dimensions can be found in Table 3.
Figure 3.
Figure 3.
GC Bias. (A) Schematic of two cycles of GC biased PCR. A sequence with balanced sequence composition (red) is readily amplified, whereas a GC rich sequence (green) does not amplify well. (B) Cumulative distribution of GC content ± SD for 1500 sequences in plateau, shoulder and tail of the sequence trace of ZL037 and ZL053. No striking differences can be observed. However, the mean GC contents of the three distributions are statistically significantly different. (C) Relative PCR efficiencies as a function of GC content as measured in the plateau of all three data sets, normalized to an efficiency of 1.9 for GC contents of 0.5 to 0.55. Linear fits are plotted as lines. PCR efficiencies are roughly constant across the observed range of GC contents, including the high GC barcode pairs of ZL053 (green). (D) Simulation of PCR with using PCR efficiencies as derived in (C), compared to ZL037 and ZL052. The simulation fails to capture the shape of the data, confirming that GC bias is insufficient to explain the observed sequence distribution.
Figure 4.
Figure 4.
PCR stochasticity. (A) Schematic of two cycles of stochastic PCR amplification. A lucky barcode pair (red) gets amplified at every cycle, whereas an unlucky barcode pair (blue) fails to get amplified at all. A barcode pair with mediocre luck is depicted in purple. (B) The exact probability distribution of sequence copy numbers after 15 cycles of PCR with Pamp = 0.9. Arrows indicate two local maxima in the PDF at roughly half and quarter of the molecule numbers as the global maximum. (C) Probability distribution of sequence copy number after 1 to 7 cycles of PCR (blue fading through orange to red). The birth and evolution of the two local maxima observed in (B) is visible. The probability distributions after j = 1..7 cycles were normalized to sum to 2j to aid visualization. (D) A sample of 2900 sequences of the approximate probability distribution after 25 cycles of PCR with Pamp = 0.9 (red) correlates closely with the 2900 most abundant sequence reads of the experimental data. Simulations for Pamp = 0.8, Pamp = 0.85, Pamp = 0.95 and Pamp = 0.99 are plotted in dashed lines.
Figure 5.
Figure 5.
Template switching. (A) Schematic of one cycle of PCR with template switching. During amplification of the blue barcode pair, the polymerase switches to the red barcode pair in the constant region, producing a blue-red chimera. (B) The barcode libraries contain two classes of barcode pairs (BC1-BC1 and BC2-BC2), that are distinguishable by purine and pyrimidine anchors (top). If a BC1-BC2 or BC2-BC1 barcode pair is detected, it must have been formed by a template switch. Such inter-class switches should make up half of all template switches. (C) Abundance of detected template switched sequences ± SD in sequence rank space. Template switches are rare in abundant sequences, but become more frequent as copy numbers reach one. (D) A simulation of template switching on a background of perfect PCR (red) captures little of the empirical sequence distribution. The only free parameter in our model of template switching, the per molecule rate of template switching s0, was independently estimated from the data.
Figure 6.
Figure 6.
Polymerase errors. (A) Schematic of two cycles of PCR with polymerase errors. Polymerase errors introduce mutations into an input barcode pair (red), effectively producing novel sequences (orange, lavender, yellow). (B) Histogram of the minimum Hamming distance ± SD from sequences in the plateau to other plateau sequences (blue) and sequences from shoulder and tail (scaled rank 2900 to 10000) to plateau sequences (green). In contrast to plateau sequences, the majority of sequences from shoulder and tail are within a Hamming distance of one (i.e. one base change) from the parent plateau sequences. (C) Position of errors detected using mismatches to anchor sequences in the barcodes in sequence rank space ± SD. While the plateau is depleted of polymerase errors, shoulder and tail sequences show a large increase in error frequency. (D) A simulation of polymerase errors on a background of perfect PCR (red) recapitulates the shoulder to tail transition of the observed sequence distribution. The polymerase error rate used for the simulation was independently estimated from the data.
Figure 7.
Figure 7.
Polymerase errors and stochasticity appear to explain a large fraction of observed data. PCR is simulated as a Galton Watson process with polymerase errors added at the average experimental rate. Simulated (red) and observed sequence profile (light and dark blue) match closely.

Similar articles

Cited by

References

    1. Aird D., Ross M.G., Chen W.S., Danielsson M., Fennell T., Russ C., Jaffe D.B., Nusbaum C., Gnirke A. Analyzing and minimizing PCR amplification bias in illumina sequencing libraries. Genome Biol. 2010;12:R18. - PMC - PubMed
    1. Dabney J., Meyer M. Length and GC-biases during sequencing library amplification: a comparison of various polymerase-buffer systems with ancient and modern DNA sequencing libraries. Biotechniques. 2012;52:87–94. - PubMed
    1. Ross M.G., Russ C., Costello M., Hollinger A., Lennon N.J., Hegarty R., Nusbaum C., Jaffe D.B. Characterizing and measuring bias in sequence data. Genome Biol. 2013;14:R51. - PMC - PubMed
    1. Jagers P., Klebaner F. Random variation and concentration effects in PCR. J. Theor. Biol. 2003;224:304–299. - PubMed
    1. Stolovitzky G., Cecchi G. Efficiency of DNA replication in the polymerase chain reaction. Proc. Natl. Acad. Sci. U.S.A. 1996;93:12952–12947. - PMC - PubMed

Publication types