Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jan 30;5(1):vey041.
doi: 10.1093/ve/vey041. eCollection 2019 Jan.

Measurements of intrahost viral diversity require an unbiased diversity metric

Affiliations

Measurements of intrahost viral diversity require an unbiased diversity metric

Lei Zhao et al. Virus Evol. .

Abstract

Viruses exist within hosts at large population sizes and are subject to high rates of mutation. As such, viral populations exhibit considerable sequence diversity. A variety of summary statistics have been developed which describe, in a single number, the extent of diversity in a viral population; such measurements allow the diversities of different populations to be compared, and the effect of evolutionary forces on a population to be assessed. Here we highlight statistical artefacts underlying some common measures of sequence diversity, whereby variation in the depth of genome sequencing may substantially affect the extent of diversity measured in a viral population, making comparisons of population diversity invalid. Specifically, naive estimation of sequence entropy provides a systematically biased metric, a lower read depth being expected to produce a lower estimate of diversity. The number of polymorphic loci per kilobase of genome is more unpredictably affected by read depth, giving potentially flawed results at lower sequencing depths. We show that the nucleotide diversity statistic π provides an unbiased estimate of diversity in the sense that the expected value of the statistic is equal to the correct value of the property being measured. Our results are of importance for studies interpreting genome sequence data; we describe how diversity may be assessed in viral populations in a fair and unbiased manner.

Keywords: entropy; polymorphism; sequence data; virus diversity.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Mean sequence entropy values calculated for sets of 1,000 loci each of which has a consistent minor variant frequency. Means of these values calculated across 100 replicates are shown as black dots, with vertical bars, where visible, showing an interval of ±2 standard deviations. The correct entropy is shown by a dashed red line. The dashed blue lines, where not obscured by the correct entropy value, show the upper and lower limits described in Equation (3), with the upper limit showing the correct sequence entropy value. Data are shown for (A) a variant frequency of 30% and (B) a variant frequency of 0.03%.
Figure 2.
Figure 2.
(A) Trend in the probability of a variant being identified as a polymorphism at 1% frequency as a function of read depth. At very high read depth, variants with a frequency greater than 1% will always be identified as polymorphisms, while variants below this frequency will never be identified as polymorphisms. Details of the function in the region between the vertical grey dashed lines are shown in (B). Detailed probability values. The range of frequencies at which a variant can be identified is constrained to the set of values i/N where N is the read depth; this constraint leads to a sawtooth pattern in the probability of identifying a polymorphism.
Figure 3.
Figure 3.
Allele frequency spectra for the two datasets analysed in this study. The within-host influenza dataset shows a small number of polymorphic sites relative to the HIV data.
Figure 4.
Figure 4.
Diversity statistics calculated for HIV (black) and influenza (red) sequence data following downsampling of the data to lower read depths. Ten replicate downsampling calculations were performed for each point; dots show mean values, with vertical bars, where visible, showing an interval of ±2 standard deviations. Dashed grey lines show the values calculated from the complete dataset.

Similar articles

Cited by

References

    1. Archer J. et al. (2012) ‘Analysis of High-Depth Sequence Data for Studying Viral Diversity: A Comparison of Next Generation Sequencing Platforms Using Segminator II’, BMC Bioinformatics, 13: 47. - PMC - PubMed
    1. Beerenwinkel N., Zagordi O. (2011) ‘Ultra-Deep Sequencing for the Analysis of Viral Populations’, Current Opinion in Virology, 1: 413. - PubMed
    1. Bull R. A. et al. (2012) ‘Contribution of Intra- and Interhost Dynamics to Norovirus Evolution’, Journal of Virology, 86: 3219. - PMC - PubMed
    1. Debbink K. et al. (2017) ‘Vaccination Has Minimal Impact on the Intrahost Diversity of H3N2 Influenza Viruses’, PLoS Pathogens, 13: e1006194. - PMC - PubMed
    1. Dinis J. M. et al. (2016) ‘Deep Sequencing Reveals Potential Antigenic Variants at Low Frequencies in Influenza a Virus-Infected Humans’, Journal of Virology, 90: 3355. - PMC - PubMed