Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 9 (2), e90346
eCollection

TASSEL-GBS: A High Capacity Genotyping by Sequencing Analysis Pipeline

Affiliations

TASSEL-GBS: A High Capacity Genotyping by Sequencing Analysis Pipeline

Jeffrey C Glaubitz et al. PLoS One.

Abstract

Genotyping by sequencing (GBS) is a next generation sequencing based method that takes advantage of reduced representation to enable high throughput genotyping of large numbers of individuals at a large number of SNP markers. The relatively straightforward, robust, and cost-effective GBS protocol is currently being applied in numerous species by a large number of researchers. Herein we describe a bioinformatics pipeline, TASSEL-GBS, designed for the efficient processing of raw GBS sequence data into SNP genotypes. The TASSEL-GBS pipeline successfully fulfills the following key design criteria: (1) Ability to run on the modest computing resources that are typically available to small breeding or ecological research programs, including desktop or laptop machines with only 8-16 GB of RAM, (2) Scalability from small to extremely large studies, where hundreds of thousands or even millions of SNPs can be scored in up to 100,000 individuals (e.g., for large breeding programs or genetic surveys), and (3) Applicability in an accelerated breeding context, requiring rapid turnover from tissue collection to genotypes. Although a reference genome is required, the pipeline can also be run with an unfinished "pseudo-reference" consisting of numerous contigs. We describe the TASSEL-GBS pipeline in detail and benchmark it based upon a large scale, species wide analysis in maize (Zea mays), where the average error rate was reduced to 0.0042 through application of population genetic-based SNP filters. Overall, the GBS assay and the TASSEL-GBS pipeline provide robust tools for studying genomic diversity.

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Schematic representation of the tassel-gbs Discovery Pipeline.
(A) Barcoded sequence reads are processed and collapsed into a set of unique sequence tags, with one TagCounts file produced per input FASTQ file. The separate TagCounts files are then merged to form a “master” TagCounts file, which retains only those tags present at or above an experiment-wide minimum count. This master tag list is then aligned to the reference genome and a TagsOnPhysicalMap (TOPM) file is generated, containing the genomic position of each tag with a unique, best alignment. (B) The barcode information in the original FASTQ files is then used to tally the number of times each tag in the master tag list is observed in each sample (“taxon”) and these counts are stored in a TagsByTaxa (TBT) file. (C) The information recorded in the TOPM and TBT is then used to discover SNPs at each “TagLocus” (set of tags with the same genomic position) and filter the SNPs based upon the proportion of taxa covered by the TagLocus, minor allele frequency, and inbreeding coefficient (FIT). For each retained SNP, the allele represented by each tag in the corresponding TagLocus is recorded in the TOPM file, along with its relative position in the locus. The end product of the Discovery Pipeline is a “production-ready” TOPM that can then be used by the Production Pipeline to call SNPs.
Figure 2
Figure 2. Relationship between the tassel-gbs Discovery and Production pipelines.
The Discovery Pipeline is run periodically on all FASTQ files generated to date in a species, and the ascertained and filtered SNPs are stored in a “production-ready” TOPM. The Production pipeline utilizes this production-ready TOPM to quickly call SNPs either for the original samples in the Discovery Build, or for subsequent, post-Discovery samples.
Figure 3
Figure 3. Within NAM family allele frequency distributions of chromosome 10 SNPs after different levels of filtering.
Allele frequencies were calculated in each of the 25 Nested Association Mapping (NAM) families (collectively comprising 5,254 RILs) after application of the filters to the entire set of 31,978 maize samples in the AllZeaGBSv2.6 build. Allele frequencies were only estimated in a NAM family if at least 19 RILs had non-missing genotypes. Each histogram shows the allele frequency distribution for all the SNP-NAM family combinations with n > =  19. (A, B) No filter other than minimum MAF of 0.001. (C, D) A minimal filter only for MAF > =  0.01. (E, F) “Standard” maize build filters of MAF > =  0.001, minimum FIT in inbred samples of 0.8, inbred coverage >0.15, and inbred heterozygosity score <0.21. (A, C, E) All SNP-family combinations: the error-free, monomorphic SNP-family combinations dwarf the segregating SNPs in all three cases. (B, D, F) Polymorphic SNP-family combinations only: omitting the monomorphic SNP-family combinations permits visualization of the remaining allele frequency distribution.
Figure 4
Figure 4. Error rate distribution of chromosome 10 SNPs for different levels of filtering.
Error rates in the AllZeaGBSv2.6 Discovery build SNP calls were estimated using the NAM biparental families. NAM family-specific minor allele calls were defined as errors if the family-specific MAF was greater than zero but less than 0.25, and the SNP significantly deviated from 1∶1 segregation in that family at p<0.001.

Similar articles

See all similar articles

Cited by 351 PubMed Central articles

See all "Cited by" articles

References

    1. Metzker ML (2010) Sequencing technologies - the next generation. Nat Rev Genet 11: 31–46 doi:10.1038/nrg2626 - DOI - PubMed
    1. Shendure J, Lieberman Aiden E (2012) The expanding scope of DNA sequencing. Nat Biotechnol 30: 1084–1094 Available: http://www.ncbi.nlm.nih.gov/pubmed/23138308. - PMC - PubMed
    1. Edwards D, Batley J, Snowdon RJ (2013) Accessing complex crop genomes with next-generation sequencing. Theor Appl Genet 126: 1–11 Available: http://www.ncbi.nlm.nih.gov/pubmed/22948437. Accessed 11 November 2013.. - PubMed
    1. Kilpinen H, Barrett JC (2013) How next-generation sequencing is transforming complex disease genetics. Trends Genet 29: 23–30 Available: http://www.ncbi.nlm.nih.gov/pubmed/23103023. Accessed 11 November 2013.. - PubMed
    1. Altshuler D, Pollara VJ, Cowles CR, Van Etten WJ, Baldwin J, et al. (2000) An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature 407: 513–516 Available: http://www.ncbi.nlm.nih.gov/pubmed/11029002. - PubMed

Publication types

Grant support

This work was supported by the National Science Foundation (www.nsf.gov) under the Plant Genome Research Program (PGRP) (grant numbers DBI-0820619 and IOS-1238014) and the Basic Research to Enable Agricultural Development (BREAD) project (ID:IOS-0965342), as well as by the USDA-ARS (www.usda.gov). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

LinkOut - more resources

Feedback