Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Filters applied. Clear all
. 2018 Jun 26;19(1):243.
doi: 10.1186/s12859-018-2227-x.

ToTem: A Tool for Variant Calling Pipeline Optimization

Affiliations
Free PMC article

ToTem: A Tool for Variant Calling Pipeline Optimization

Nikola Tom et al. BMC Bioinformatics. .
Free PMC article

Abstract

Background: High-throughput bioinformatics analyses of next generation sequencing (NGS) data often require challenging pipeline optimization. The key problem is choosing appropriate tools and selecting the best parameters for optimal precision and recall.

Results: Here we introduce ToTem, a tool for automated pipeline optimization. ToTem is a stand-alone web application with a comprehensive graphical user interface (GUI). ToTem is written in Java and PHP with an underlying connection to a MySQL database. Its primary role is to automatically generate, execute and benchmark different variant calling pipeline settings. Our tool allows an analysis to be started from any level of the process and with the possibility of plugging almost any tool or code. To prevent an over-fitting of pipeline parameters, ToTem ensures the reproducibility of these by using cross validation techniques that penalize the final precision, recall and F-measure. The results are interpreted as interactive graphs and tables allowing an optimal pipeline to be selected, based on the user's priorities. Using ToTem, we were able to optimize somatic variant calling from ultra-deep targeted gene sequencing (TGS) data and germline variant detection in whole genome sequencing (WGS) data.

Conclusions: ToTem is a tool for automated pipeline optimization which is freely available as a web application at https://totem.software .

Keywords: Benchmarking; Next generation sequencing; Parameter optimization; Variant calling.

Conflict of interest statement

Ethics approval and consent to participate

The whole study and written informed consent obtained from all patients analysed for variant discovery in the TP53 were approved by the Ethical Committee of University Hospital Brno in concordance with the Declaration of Helsinki.

For GIAB data, ethics approval is not required as the human data were publicly available on the GIAB website.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
a Once the pipeline is set up for the optimization, all the configurations are run in parallel using raw input data. In this particular example, the emphasis is placed on optimizing the variant calling filters, however, the pipeline design depends on the user’s needs. In the case of the GIAB approach, the benchmarking step is part of the pipeline done by RTG Tools and hap.py. The pipeline results in the form of the stratified performance reports (csv) provided by hap.py are imported into ToTem’s internal database and filtered using ToTem’s filtering tool. This allows the best performing pipeline to be selected based on the chosen quality metrics, variant type and genomic region. b Similar to the previous diagram, the optimization is focused on tuning the variant filtering. Contrary to the previous case, Little Profet requires the pipeline results to be represented as tables of normalized variants with mandatory headers (CHROM, POS, REF, ALT). Such data are imported into ToTem’s internal database for pipeline benchmarking by the Little Profet method. Benchmarking is done by comparing the results of each pipeline to the ground truth reference variant calls in the given regions of interest and by estimating TP, FP, FN; and quality metrics derived from them - precision, recall and F-measure. To prevent overfitting of the pipelines, Little Profet also calculates the reproducibility of each quality metric over different data subsets. The results are provided in the form of interactive graphs and tables
Fig. 2
Fig. 2
Each dot represents an arithmetic mean of recall (X-axis) and precision (Y-axis) for one pipeline configuration calculated based on repeated random sub-sampling of 3 input datasets (220 samples). The crosshair lines show the standard deviation of the respective results across the sub-sampled sets. Individual variant callers (Mutect2, VarDict and VarScan2) are colour coded with a distinguished default setting for each. The default settings and the best performing configurations for each variant caller are also enlarged. Based on our experiment, the largest variant calling improvement (2.36× higher F-measure compared to default settings, highlighted by an arrow) and also the highest overall recall, precision, precision-recall, and F-measure were registered for VarScan2. In case of VarDict, a significant improvement in variant detection, mainly for recall (2.42×) was observed. The optimization effect on Mutect2 had a great effect on increasing the precision (1.74×). Although the F-measure after optimization did not reach as high values as VarScan2 and VarDict, Mutect2’s default setting provided the best results, mainly in a sense of recall

Similar articles

See all similar articles

References

    1. Park JY, Kricka LJ, Fortina P. Next-generation sequencing in the clinic. Nat Biotechnol. 2013;31:990–992. doi: 10.1038/nbt.2743. - DOI - PubMed
    1. Pabinger S, Dander A, Fischer M, Snajder R, Sperk M, Efremova M, et al. A survey of tools for variant analysis of next-generation genome sequencing data. Brief Bioinform. 2014;15:256–278. doi: 10.1093/bib/bbs086. - DOI - PMC - PubMed
    1. DePristo MA, Banks E, Poplin RE, Garimella KV, Maguire JR, Hartl C, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43:491–498. doi: 10.1038/ng.806. - DOI - PMC - PubMed
    1. Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, del Angel G, Levy-Moonshine A, et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinforma Ed Board Andreas Baxevanis Al. 2013;43:11. - PMC - PubMed
    1. Hwang S, Kim E, Lee I, Marcotte EM. Systematic comparison of variant calling pipelines using gold standard personal exome variants. Sci Rep. 2015;5:srep17875. doi: 10.1038/srep17875. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources

Feedback