Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Aug;89(16):8540-55.
doi: 10.1128/JVI.00522-15. Epub 2015 Jun 3.

Primer ID Validates Template Sampling Depth and Greatly Reduces the Error Rate of Next-Generation Sequencing of HIV-1 Genomic RNA Populations

Affiliations

Primer ID Validates Template Sampling Depth and Greatly Reduces the Error Rate of Next-Generation Sequencing of HIV-1 Genomic RNA Populations

Shuntai Zhou et al. J Virol. 2015 Aug.

Abstract

Validating the sampling depth and reducing sequencing errors are critical for studies of viral populations using next-generation sequencing (NGS). We previously described the use of Primer ID to tag each viral RNA template with a block of degenerate nucleotides in the cDNA primer. We now show that low-abundance Primer IDs (offspring Primer IDs) are generated due to PCR/sequencing errors. These artifactual Primer IDs can be removed using a cutoff model for the number of reads required to make a template consensus sequence. We have modeled the fraction of sequences lost due to Primer ID resampling. For a typical sequencing run, less than 10% of the raw reads are lost to offspring Primer ID filtering and resampling. The remaining raw reads are used to correct for PCR resampling and sequencing errors. We also demonstrate that Primer ID reveals bias intrinsic to PCR, especially at low template input or utilization. cDNA synthesis and PCR convert ca. 20% of RNA templates into recoverable sequences, and 30-fold sequence coverage recovers most of these template sequences. We have directly measured the residual error rate to be around 1 in 10,000 nucleotides. We use this error rate and the Poisson distribution to define the cutoff to identify preexisting drug resistance mutations at low abundance in an HIV-infected subject. Collectively, these studies show that >90% of the raw sequence reads can be used to validate template sampling depth and to dramatically reduce the error rate in assessing a genetically diverse viral population using NGS.

Importance: Although next-generation sequencing (NGS) has revolutionized sequencing strategies, it suffers from serious limitations in defining sequence heterogeneity in a genetically diverse population, such as HIV-1 due to PCR resampling and PCR/sequencing errors. The Primer ID approach reveals the true sampling depth and greatly reduces errors. Knowing the sampling depth allows the construction of a model of how to maximize the recovery of sequences from input templates and to reduce resampling of the Primer ID so that appropriate multiplexing can be included in the experimental design. With the defined sampling depth and measured error rate, we are able to assign cutoffs for the accurate detection of minority variants in viral populations. This approach allows the power of NGS to be realized without having to guess about sampling depth or to ignore the problem of PCR resampling, while also being able to correct most of the errors in the data set.

PubMed Disclaimer

Figures

FIG 1
FIG 1
Example of Primer ID distribution. Most of the Primer IDs appear at very low frequency (once or twice), while some of them appear several hundreds of times in the raw read output. Artifacts of mutations within the Primer ID (offspring) and PCR amplification skewing and primer ID resampling are suggested as features that help shape the observed distribution of reads per Primer ID.
FIG 2
FIG 2
Adaptation of the Primer ID approach to the MiSeq platform. MiSeq library construction with the Primer ID approach from viral RNA template was used for sequencing. The Primer ID (yellow, N8) is included in the cDNA primer, along with a PCR primer site (brown), and the upstream primer includes four randomized bases to add diversity to the initial sequence read (orange, N4). Illumina indexed primers (green with purple barcode) are included in the last round of PCR. The paired-end sequence of region 1 (R1) and region 2 (R2), which may or may not overlap in the middle, are indicated.
FIG 3
FIG 3
Assessment of offspring Primer IDs. (a) The Primer ID sequence for “singles” have significantly lower quality scores than the Primer ID sequences at the highest frequency. Primer IDs were 8 nucleotides long. (b) Primer ID distribution and percentages of Primer IDs at low abundance (i.e., read less than 23 times) with one or two nucleotide differences from an abundant consensus Primer ID. Data were generated from the dilution experiment sample RSD11. This example was chosen to highlight the issue of offspring Primer IDs, which is exacerbated when low-input template copies are used. In this case, the total number of consensus sequences above the cutoff was only 121, which is why there is not a symmetrical distribution of raw reads per Primer ID. Symbols for one (red squares) and two (green triangles) nucleotide differences are read on the percentage scale, while the symbol for number of Primer IDs (blue diamonds) is read on the log scale.
FIG 4
FIG 4
Simulated correlation of the abundance of observed parental Primer IDs and the maximum abundance of the offspring Primer ID. Open squares indicate the mean number of maximum abundances of offspring Primer IDs given the observed number of parental Primer IDs. Open circles indicate the upper limit of the 95% confidence intervals of the maximum abundances of offspring Primer IDs, which serve as the Primer ID read number cutoffs for the given abundances of observed maximum parental Primer IDs in a sequencing library. 4a, observed parental Primer ID from 0 to 20,000; 4b, observed parental Primer ID from 0 to 2,000.
FIG 5
FIG 5
Correlation of the number of total Primer IDs, the number of Primer IDs that appear more than twice, and the number of template consensus sequences using the Primer ID read number cutoff model as a function of the number of input templates. Primer ID was 8 nucleotides long. The data are plotted from the experiment shown in the table below the graph, and the percentage of the sequences discarded using the Primer ID read number cutoff model is shown.
FIG 6
FIG 6
Comparison of Primer ID distribution in two replications of library construction and sequencing of the same template. The distribution of the top 10% (in read abundance) Primer IDs from run 1 (red) and the bottom 90% (blue) that also appeared in run 2 were analyzed for their distribution in run 2.
FIG 7
FIG 7
Primer ID distribution as observed and compared to three models. Blue diamonds correspond to the Primer ID distribution from a plasma sample. We modeled Primer ID distributions under three different sets of assumptions. In model 1 (red squares), we assumed that there were no sequencing errors within the 8-nucleotide Primer ID sequence block, and all templates were included in the PCR with 100% efficiency. In model 2 (green triangles), we included 1% PCR/sequencing substitutions at the Primer ID region. In model 3 (purple circle), we assumed that only half of the templates were used in each of the first 10 cycles of PCR before sequencing, in addition to a 1% substitution rate in the Primer ID sequence block.
FIG 8
FIG 8
Patterns of Primer ID resampling and template coverage. (a) Relationship between the number of raw sequences and Primer ID resampling (i.e., the percentage of template consensus sequences from more than one template in all of the template consensus sequences recovered) at different levels of converted templates. (b) Relationship between the number of raw sequences and template recovery at different levels of converted templates.

Similar articles

Cited by

References

    1. Meyerhans A, Vartanian JP, Wain-Hobson S. 1990. DNA recombination during PCR. Nucleic Acids Res 18:1687–1691. doi:10.1093/nar/18.7.1687. - DOI - PMC - PubMed
    1. Gorzer I, Guelly C, Trajanoski S, Puchhammer-Stockl E. 2010. The impact of PCR-generated recombination on diversity estimation of mixed viral populations by deep sequencing. J Virol Methods 169:248–252. doi:10.1016/j.jviromet.2010.07.040. - DOI - PubMed
    1. Liu SL, Rodrigo AG, Shankarappa R, Learn GH, Hsu L, Davidov O, Zhao LP, Mullins JI. 1996. HIV quasispecies and resampling. Science 273:415–416. doi:10.1126/science.273.5274.415. - DOI - PubMed
    1. Robinson DG, Storey JD. 2014. subSeq: determining appropriate sequencing depth through efficient read subsampling. Bioinformatics (Oxford, England) 30:3424–3426. doi:10.1093/bioinformatics/btu552. - DOI - PMC - PubMed
    1. Rapaport F, Khanin R, Liang Y, Pirun M, Krek A, Zumbo P, Mason CE, Socci ND, Betel D. 2013. Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biol 14:R95. doi:10.1186/gb-2013-14-9-r95. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources