Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 7 (5), e37135

Double Digest RADseq: An Inexpensive Method for De Novo SNP Discovery and Genotyping in Model and Non-Model Species

Affiliations

Double Digest RADseq: An Inexpensive Method for De Novo SNP Discovery and Genotyping in Model and Non-Model Species

Brant K Peterson et al. PLoS One.

Abstract

The ability to efficiently and accurately determine genotypes is a keystone technology in modern genetics, crucial to studies ranging from clinical diagnostics, to genotype-phenotype association, to reconstruction of ancestry and the detection of selection. To date, high capacity, low cost genotyping has been largely achieved via "SNP chip" microarray-based platforms which require substantial prior knowledge of both genome sequence and variability, and once designed are suitable only for those targeted variable nucleotide sites. This method introduces substantial ascertainment bias and inherently precludes detection of rare or population-specific variants, a major source of information for both population history and genotype-phenotype association. Recent developments in reduced-representation genome sequencing experiments on massively parallel sequencers (commonly referred to as RAD-tag or RADseq) have brought direct sequencing to the problem of population genotyping, but increased cost and procedural and analytical complexity have limited their widespread adoption. Here, we describe a complete laboratory protocol, including a custom combinatorial indexing method, and accompanying software tools to facilitate genotyping across large numbers (hundreds or more) of individuals for a range of markers (hundreds to hundreds of thousands). Our method requires no prior genomic knowledge and achieves per-site and per-individual costs below that of current SNP chip technology, while requiring similar hands-on time investment, comparable amounts of input DNA, and downstream analysis times on the order of hours. Finally, we provide empirical results from the application of this method to both genotyping in a laboratory cross and in wild populations. Because of its flexibility, this modified RADseq approach promises to be applicable to a diversity of biological questions in a wide range of organisms.

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. A flexible genotyping method can be used to optimize the number of genetic markers for a specific experimental approach in a given biological system.
Segregating genetic markers are used to make inferences about historical processes (e.g., phylogenetic relationships, population structure) and functional mechanisms (e.g., genotype-phenotype mapping), but the optimal number of markers (fraction of the genome) needed to achieve a desired level of resolution differs based on both the experimental approach and the specific biological system–the number of genetic markers needed to recover relationships among populations or species is related to divergence among groups (e.g., more recent or more rapid events require more variable loci); the number of markers required for optimal resolution in phenotype-mapping experiments (conducted in laboratory crosses or pedigreed wild populations) is a function of the number of recombination events captured in the pedigree; the number of markers used in association mapping or selection scans in wild populations is determined by genome-wide levels of linkage disequilibrium, which is largely dictated by demographic history. Recent methods combining reduced representation library construction and next-gen sequencing (i.e., RADseq [6]) target an intermediate number of regions (shown schematically above). We expand on this approach to provide marker sets ranging from 100s to 100,000s of regions at low cost with no requirement of prior genomic data (ddRADseq; double digest RAD sequencing).
Figure 2
Figure 2. Double digest RAD sequencing improves efficiency and robustness while minimizing cost.
(A) Traditional Restriction-Site Associated DNA sequencing (RADseq) uses a single restriction enzyme (RE) digest coupled with secondary random fragmentation and broad size selection to generate reduced representation libraries consisting of all genomic regions adjacent to the RE cut site (red segments). (B) Double digest RAD sequencing (ddRADseq), by contrast, uses a two enzyme double digest followed by precise size selection that excludes regions flanked by either [a] very close or [b] very distant RE recognition sites, recovering a library consisting of only fragments close to the target size (red segments). Representation in this library is expected to be inversely proportional to deviation from the size-selection target, thus read counts across regions are expected to be correlated between individuals (yellow and green bars).
Figure 3
Figure 3. Double digest RAD sequencing provides flexibility in the number of homologous fragments recovered.
Changing the restriction enzyme (RE) or size-selection regime modifies the fraction of genome recovered. Simulation 1 (blue lines, shading): the expected fragment size distribution for a RE digest with NlaIII and MluCI (CATG and AATT) in the Mus musculus genome is shown (solid blue line). “Broad” size selection (300 bp±50 bp) is modeled by a normal sampling distribution (mean = 300 bp, SD = 25 bp). Under this sampling distribution, 4,900,000 sequence reads (dashed blue line) are expected to cover ∼119,000 regions at 7× or greater (blue area). Simulation 2 (red lines, shading): the expected fragment size distribution for a digest with EcoRI and MspI (GAATTC and CCGG) is shown (solid red line). “Narrow” size selection (300 bp±24 bp; see text) is modeled by a normal sampling distribution (mean = 300 bp, SD = 11 bp; see Analysis S1 Supporting Figure 1). Under this sampling distribution, an investment of 315,000 sequence reads (dashed red line) is sufficient to recover ∼17,000 regions at 7× or greater (red area).
Figure 4
Figure 4. Recovery of genomic regions in deer mice (Peromyscus maniculatus and P. polionotus) is well predicted by simulation based on the laboratory mouse (Mus musculus) genome with precise size selection.
Simulated data based on the Mus musculus genome (dashed lines) and actual data from a distantly related rodents Peromyscus maniculatus and P. polionotus (solid lines), both fragmented with EcoRI and MspI recognition sites. Sampling from the Mus genome is drawn from a normal distribution (mean = 300 bp and SD = 11.5, 17.5, and 30), which represents the best match for Peromyscus ddRADseq with size-selection windows of ±24 bp (“narrow”, green), ±36 bp (“wide”, blue) and ±25–50 bp (“gel”, red) respectively. The narrow and wide selection sets are based on a more precise automated size-selection method (PippinPrep, Sage Science). Recovery in ddRADseq experiments, both within and across individuals, is highly predictable: (A) Region coverage is highly correlated between simulated Mus and observed Peromyscus data. Simulations show good fit to automated size selection (median samples from each sizing strategy and simulation of matched read counts, r2 0.99 and 0.98 for narrow and wide sizing, respectively), but match less well for gel extraction (median r2 0.94). (B) Simulated data are concordant in mean sequence coverage across fragments as a function of total read depth per individual in all size-selection schemes (open circles: observed data, dotted line: simulation). (C) The number of regions with coverage ≥7× as a function of total read depth per individual, and (D) mean number of regions with coverage ≥7× shared with other individuals, show very high concordance with normal sampling distributions in both narrow and wide automated size selection but are less well fit by any tested sampling distribution for the gel extraction method.
Figure 5
Figure 5. Discovery and genotyping of ddRADseq markers in a laboratory cross and wild populations without a reference genome.
ddRADseq was used to identify SNPs between two Peromyscus species, neither of which had a genome sequence available, that were crossed as part of a QTL experiment. This yielded 1158 unique markers that were fixed within, but different between, the parental species. By calculating the fraction of recombinant genotypes and LOD of linkage between markers, we generated (A) 24 groups of strongly linked markers, heatmap colors represent strength of linkage in both recombination frequency (upper left) and LOD (lower right) between all pairs of markers; and (B) a genetic map with average inter-marker distance of 1.6 cM. ddRADseq was also used to genotype wild-caught and lab-reared individuals of P. leucopus. Our ddRADseq method permitted successful genotyping of wild-caught individuals even when the allelic variants within a population are unknown. (C) Estimated site frequency spectrum of a wild population of P. leucopus caught in a single Louisiana population. (D) Genetic structure between five populations of P. leucopus. Dots represent individuals (N = 92) and color indicates the states from which individuals were collected: LA = Louisiana; NE = Nebraska; PA = Pennsylvania; MA = Massachusetts; NC = North Carolina.

Similar articles

See all similar articles

Cited by 491 articles

See all "Cited by" articles

References

    1. Mackay TFC, Richards S, Stone EA, Barbadilla A, Ayroles JF, et al. The Drosophila melanogaster Genetic Reference Panel. Nature. 2012;482:173–178. doi : 10.1038/nature10811. - PMC - PubMed
    1. Consortium 1000 GP. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. doi: 10.1038/nature09534. - DOI - PMC - PubMed
    1. Altshuler D, Pollara VJ, Cowles CR, van Etten WJ, Baldwin J, et al. An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature. 2000;407:513–516. doi: 10.1038/35035083. - DOI - PubMed
    1. van Tassell CP, Smith TPL, Matukumalli LK, Taylor JF, Schnabel RD, et al. SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries. Nature Methods. 2008;5:247–252. doi: 10.1038/NMETH.1185. - DOI - PubMed
    1. Gompert Z, Forister ML, Fordyce JA, Nice CC, Williamson RJ, et al. Bayesian analysis of molecular variance in pyrosequences quantifies population genetic structure across the genome of Lycaeides butterflies. Molecular Ecology. 2010;19:2455–2473. doi: 10.1111/j.1365–294X.2010.04666.x. - DOI - PubMed

Publication types

Substances

LinkOut - more resources

Feedback