Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Apr 5;36(24):5582-5589.
doi: 10.1093/bioinformatics/btaa1081.

Accurate, scalable cohort variant calls using DeepVariant and GLnexus

Affiliations

Accurate, scalable cohort variant calls using DeepVariant and GLnexus

Taedong Yun et al. Bioinformatics. .

Abstract

Motivation: Population-scale sequenced cohorts are foundational resources for genetic analyses, but processing raw reads into analysis-ready cohort-level variants remains challenging.

Results: We introduce an open-source cohort-calling method that uses the highly accurate caller DeepVariant and scalable merging tool GLnexus. Using callset quality metrics based on variant recall and precision in benchmark samples and Mendelian consistency in father-mother-child trios, we optimize the method across a range of cohort sizes, sequencing methods and sequencing depths. The resulting callsets show consistent quality improvements over those generated using existing best practices with reduced cost. We further evaluate our pipeline in the deeply sequenced 1000 Genomes Project (1KGP) samples and show superior callset quality metrics and imputation reference panel performance compared to an independently generated GATK Best Practices pipeline.

Availability and implementation: We publicly release the 1KGP individual-level variant calls and cohort callset (https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/1KGP) to foster additional development and evaluation of cohort merging methods as well as broad studies of genetic variation. Both DeepVariant (https://github.com/google/deepvariant) and GLnexus (https://github.com/dnanexus-rnd/GLnexus) are open-source, and the optimized GLnexus setup discovered in this study is also integrated into GLnexus public releases v1.2.2 and later.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Genotype quality (GQ) distribution properties of PASS variants. (A) Genotype quality calibration for DeepVariant v0.8.0. Reported GQ is plotted against the empirical GQ calculated using genome-wide GIAB benchmark variant calls at 40× coverage. Each data point is a set of variant calls with the same GQ (x-axis), and the y-axis value is the empirical error rate calculated from the GIAB truth set. Both axes are in Phred-scale. Marker areas are proportional to the square root of the number of variants. The dotted y=x line represents perfect calibration. (B) Genotype quality calibration for GATK4 HaplotypeCaller, analogous to (A). (C) Distributions of reported GQ for DeepVariant v0.8.0 in all 1248 samples computed genome-wide. (D) Distributions of reported GQ for GATK4 HaplotypeCaller in all 1248 samples computed on chromosome 2 only. Note the broken y-axis and different scales. See also Supplementary Figure S4
Fig. 2.
Fig. 2.
Parameter search for n =1247, 15× cohort. Each data point represents a unique parameter combination explored by Vizier. The color indicates whether the GLnexus parameter to revise genotypes was true (orange) or false (blue), and the shape represents the search algorithm. The x-axis indicates Mendelian violation rate. The y-axis indicates errors on GIAB through the harmonic mean of SNP F1 and indel F1 (lower is more accurate). Points toward the lower left are more accurate on both metrics. The intersection of the green horizontal and vertical dotted lines indicates the performance using GLnexus with no variant modification (Supplementary Table S4). Supplementary Figure S5 shows the results for all cohort sizes and coverages. The red diamond indicates the parameter set we selected for the optimized DeepVariant+GLnexus pipeline
Fig. 3.
Fig. 3.
Comparison of four cohort callset creation methods. Four calling and merging pipelines are applied at both 15× and 40× sequence coverage for WGS cohorts of size n =3, 100, 333 and 1247. Five evaluation metrics are presented: Mendelian Violation Rate, SNP False Discovery Rate (1-Precision), SNP False Negative Rate (1-Recall), indel False Discovery Rate and indel False Negative Rate. In all cases, lower values are better. All evaluation metrics are computed on chr20. See Supplementary Table S5 for the precise values and the variances of each metric
Fig. 4.
Fig. 4.
1KGP cohort callset quality. (A) Ti:Tv ratios of 1KGP samples, from single-sample SNPs and joint-called SNPs, generated by DV-GLN-OPT and GATK pipeline. Each point represents the ratio in one of the 2504 samples across the whole genome. Each point cloud compares the Ti:Tv ratios in variant calls from the two systems, after equivalent steps are performed. The first cloud (in light green) compares the Ti:Tv ratios from DeepVariant (y-axis) and GATK HaplotypeCaller (x-axis) single sample calls. The second cloud (in turquoise) compares Ti:Tv after joint-genotyping is performed (optimized GLnexus for DeepVariant, and GenomicsDBImport+GenotypeGVCFs for GATK HaplotypeCaller). Finally the third cloud (in blue) compares the final outputs from the two systems, after VQSR is performed for GATK (x-axis), while no additional operation is performed for the optimized DeepVariant-GLnexus calls. (B) Fractional counts of autosomal variants with low HWE p-values, binned by non-major allele frequency in DV-GLN-OPT, GATK-VQSR and GATK-Joint. The major allele is the allele with the largest allele count in a given variant within the callset. The variants are aggregated in non-major-allele-frequency bins of size 0.0125, and the frequency is clipped at 0.5 for visualization purposes (for all methods the fractional counts in bins after 0.5 are less than 103)
Fig. 5.
Fig. 5.
Mendelian violations in autosomes of a cryptic trio in 1KGP. (A) The percentage of variants that violate Mendelian inheritance in the trio NA20900-NA20891-NA20882 as a function of the number of variants considered. Variants are ranked by the minimum GQ within the trio. Callset variants with homozygous reference calls for all three trio samples, and those have indeterminate violation status due to missing genotype calls in the trio, are ignored. (B) Mendelian violation percentages of the same trio binned by minimum GQ in the trio using bin size 5
Fig. 6.
Fig. 6.
Imputation accuracy of 1KGP reference panel. (A) Variant sites covered by DV-GLN-OPT and GATK panel. The DV-GLN-OPT reference panel generated from 1KGP samples covers 43 181 562 variant sites, while the GATK panel from the same samples covers 41 247 330 sites. The intersection of the two panel regions (marked in light blue) covers 40 972 007 sites, which is 94.88% of the DV-GLN-OPT panel and 99.33% of the GATK panel. (B) Imputed genotype accuracy for indels. The accuracy of the imputed variants are measured by computing concordance with the GIAB benchmark calls using hap.py. Blue colored markers are from DV-GLN-OPT panel while the red markers are from GATK panel. The shaped markers show precision and recall computed across the GIAB evaluation region for two samples. (C) Imputed genotype accuracy for SNPs. Shapes and colors as in (B)
Fig. 7.
Fig. 7.
Cost benchmarking DeepVariant-GLnexus and GATK pipeline. (A) Distribution of elapsed real times to generate single-sample gVCF (chr22 only) from aligned reads across n =2504 1KGP samples, using DeepVariant and GATK HaplotypeCaller (BQSR not included) in 8-vCPU machine. GPU/TPU acceleration was not used for DeepVariant. (B) Elapsed real times to generate gVCF (chr22 only) of one sample (NA12878) using a cloud machine with varying number of vCPUs, with DeepVariant and GATK HaplotypeCaller (excluding BQSR). The default value for HaplotypeCaller’s HMM multithreading flag (–native-pair-hmm-threads) is 4 (red arrow) and it was practically ineffectual for 16 vCPUs and more (red dotted lines). (C) Elapsed real times to merge the chr22 gVCF files from (A) into a cohort VCF for n{10,100,1000,2504} nested subsets of the 1KGP samples, using GLnexus (for DeepVariant gVCFs) and GATK GenomicsDBImport + GenotypeGVCFs (for HaplotypeCaller gVCFs). GATK VQSR step was not included. (D) The file sizes of the whole-genome cohort VCFs and the single-sample gVCFs of 1KGP samples from DeepVariant-GLnexus and GATK pipeline

Similar articles

Cited by

  • Bioinformatics of germline variant discovery for rare disease diagnostics: current approaches and remaining challenges.
    Barbitoff YA, Ushakov MO, Lazareva TE, Nasykhova YA, Glotov AS, Predeus AV. Barbitoff YA, et al. Brief Bioinform. 2024 Jan 22;25(2):bbad508. doi: 10.1093/bib/bbad508. Brief Bioinform. 2024. PMID: 38271481 Free PMC article. Review.
  • Structural and non-coding variants increase the diagnostic yield of clinical whole genome sequencing for rare diseases.
    Pagnamenta AT, Camps C, Giacopuzzi E, Taylor JM, Hashim M, Calpena E, Kaisaki PJ, Hashimoto A, Yu J, Sanders E, Schwessinger R, Hughes JR, Lunter G, Dreau H, Ferla M, Lange L, Kesim Y, Ragoussis V, Vavoulis DV, Allroggen H, Ansorge O, Babbs C, Banka S, Baños-Piñero B, Beeson D, Ben-Ami T, Bennett DL, Bento C, Blair E, Brasch-Andersen C, Bull KR, Cario H, Cilliers D, Conti V, Davies EG, Dhalla F, Dacal BD, Dong Y, Dunford JE, Guerrini R, Harris AL, Hartley J, Hollander G, Javaid K, Kane M, Kelly D, Kelly D, Knight SJL, Kreins AY, Kvikstad EM, Langman CB, Lester T, Lines KE, Lord SR, Lu X, Mansour S, Manzur A, Maroofian R, Marsden B, Mason J, McGowan SJ, Mei D, Mlcochova H, Murakami Y, Németh AH, Okoli S, Ormondroyd E, Ousager LB, Palace J, Patel SY, Pentony MM, Pugh C, Rad A, Ramesh A, Riva SG, Roberts I, Roy N, Salminen O, Schilling KD, Scott C, Sen A, Smith C, Stevenson M, Thakker RV, Twigg SRF, Uhlig HH, van Wijk R, Vona B, Wall S, Wang J, Watkins H, Zak J, Schuh AH, Kini U, Wilkie AOM, Popitsch N, Taylor JC. Pagnamenta AT, et al. Genome Med. 2023 Nov 9;15(1):94. doi: 10.1186/s13073-023-01240-0. Genome Med. 2023. PMID: 37946251 Free PMC article.
  • Exploiting public databases of genomic variation to quantify evolutionary constraint on the branch point sequence in 30 plant and animal species.
    Nosková A, Li C, Wang X, Leonard AS, Pausch H, Kadri NK. Nosková A, et al. Nucleic Acids Res. 2023 Dec 11;51(22):12069-12075. doi: 10.1093/nar/gkad970. Nucleic Acids Res. 2023. PMID: 37953306 Free PMC article.
  • Multiancestry exome sequencing reveals INHBE mutations associated with favorable fat distribution and protection from diabetes.
    Akbari P, Sosina OA, Bovijn J, Landheer K, Nielsen JB, Kim M, Aykul S, De T, Haas ME, Hindy G, Lin N, Dinsmore IR, Luo JZ, Hectors S, Geraghty B, Germino M, Panagis L, Parasoglou P, Walls JR, Halasz G, Atwal GS; Regeneron Genetics Center; DiscovEHR Collaboration; Jones M, LeBlanc MG, Still CD, Carey DJ, Giontella A, Orho-Melander M, Berumen J, Kuri-Morales P, Alegre-Díaz J, Torres JM, Emberson JR, Collins R, Rader DJ, Zambrowicz B, Murphy AJ, Balasubramanian S, Overton JD, Reid JG, Shuldiner AR, Cantor M, Abecasis GR, Ferreira MAR, Sleeman MW, Gusarova V, Altarejos J, Harris C, Economides AN, Idone V, Karalis K, Della Gatta G, Mirshahi T, Yancopoulos GD, Melander O, Marchini J, Tapia-Conyer R, Locke AE, Baras A, Verweij N, Lotta LA. Akbari P, et al. Nat Commun. 2022 Aug 23;13(1):4844. doi: 10.1038/s41467-022-32398-7. Nat Commun. 2022. PMID: 35999217 Free PMC article.
  • From beasts to bytes: Revolutionizing zoological research with artificial intelligence.
    Zhang YJ, Luo Z, Sun Y, Liu J, Chen Z. Zhang YJ, et al. Zool Res. 2023 Nov 18;44(6):1115-1131. doi: 10.24272/j.issn.2095-8137.2023.263. Zool Res. 2023. PMID: 37933101 Free PMC article. Review.

References

    1. Amendola L.M. et al. (2018) The Clinical Sequencing Evidence-Generating Research Consortium: integrating genomic sequencing in diverse and medically underserved populations. Am. J. Hum. Genet., 103, 319–327. - PMC - PubMed
    1. Bainbridge M.N. et al. (2011) Targeted enrichment beyond the consensus coding DNA sequence exome reveals exons with higher variant densities. Genome Biol., 12, R68. - PMC - PubMed
    1. Brier G.W. (1950) Verification of forecasts expressed in terms of probability. Mon. Weather Rev., 78, 1–3.
    1. Browning B.L. et al. (2018) A one-penny imputed genome from next-generation reference panels. Am. J. Hum. Genet., 103, 338–348. - PMC - PubMed
    1. Bycroft C. et al. (2018) The UK Biobank resource with deep phenotyping and genomic data. Nature, 562, 203–209. - PMC - PubMed