On genomic repeats and reproducibility

Bioinformatics. 2016 Aug 1;32(15):2243-7. doi: 10.1093/bioinformatics/btw139. Epub 2016 Mar 11.

Abstract

Results: Here, we present a comprehensive analysis on the reproducibility of computational characterization of genomic variants using high throughput sequencing data. We reanalyzed the same datasets twice, using the same tools with the same parameters, where we only altered the order of reads in the input (i.e. FASTQ file). Reshuffling caused the reads from repetitive regions being mapped to different locations in the second alignment, and we observed similar results when we only applied a scatter/gather approach for read mapping-without prior shuffling. Our results show that, some of the most common variation discovery algorithms do not handle the ambiguous read mappings accurately when random locations are selected. In addition, we also observed that even when the exact same alignment is used, the GATK HaplotypeCaller generates slightly different call sets, which we pinpoint to the variant filtration step. We conclude that, algorithms at each step of genomic variation discovery and characterization need to treat ambiguous mappings in a deterministic fashion to ensure full replication of results.

Availability and implementation: Code, scripts and the generated VCF files are available at DOI:10.5281/zenodo.32611.

Contact: calkan@cs.bilkent.edu.tr

Supplementary information: Supplementary data are available at Bioinformatics online.

MeSH terms

  • Genome
  • Genomics*
  • High-Throughput Nucleotide Sequencing*
  • Reproducibility of Results
  • Sequence Analysis, DNA