How to optimize the precision of allele and haplotype frequency estimates using pooled-sequencing data

Mol Ecol Resour. 2018 Mar;18(2):194-203. doi: 10.1111/1755-0998.12723. Epub 2017 Nov 4.

Abstract

Sequencing pools of individuals rather than individuals separately reduces the costs of estimating allele frequencies at many loci in many populations. Theoretical and empirical studies show that sequencing pools comprising a limited number of individuals (typically fewer than 50) provides reliable allele frequency estimates, provided that the DNA pooling and DNA sequencing steps are carefully controlled. Unequal contributions of different individuals to the DNA pool and the mean and variance in sequencing depth both can affect the standard error of allele frequency estimates. To our knowledge, no study separately investigated the effect of these two factors on allele frequency estimates; so that there is currently no method to a priori estimate the relative importance of unequal individual DNA contributions independently of sequencing depth. We develop a new analytical model for allele frequency estimation that explicitly distinguishes these two effects. Our model shows that the DNA pooling variance in a pooled sequencing experiment depends solely on two factors: the number of individuals within the pool and the coefficient of variation of individual DNA contributions to the pool. We present a new method to experimentally estimate this coefficient of variation when planning a pooled sequencing design where samples are either pooled before or after DNA extraction. Using this analytical and experimental framework, we provide guidelines to optimize the design of pooled sequencing experiments. Finally, we sequence replicated pools of inbred lines of the plant Medicago truncatula and show that the predictions from our model generally hold true when estimating the frequency of known multilocus haplotypes using pooled sequencing.

Keywords: allele frequency estimation; coverage depth; experimental evolution; fitness; haplotype frequency estimation; population genomics.

MeSH terms

  • Computational Biology / methods*
  • Gene Frequency*
  • Genetics, Population / methods*
  • Haplotypes*
  • Medicago truncatula / classification
  • Medicago truncatula / genetics
  • Sequence Analysis, DNA / methods*