SMaSH: Sample matching using SNPs in humans

BMC Genomics. 2019 Dec 30;20(Suppl 12):1001. doi: 10.1186/s12864-019-6332-7.


Background: Inadvertent sample swaps are a real threat to data quality in any medium to large scale omics studies. While matches between samples from the same individual can in principle be identified from a few well characterized single nucleotide polymorphisms (SNPs), omics data types often only provide low to moderate coverage, thus requiring integration of evidence from a large number of SNPs to determine if two samples derive from the same individual or not.

Methods: We select about six thousand SNPs in the human genome and develop a Bayesian framework that is able to robustly identify sample matches between next generation sequencing data sets.

Results: We validate our approach on a variety of data sets. Most importantly, we show that our approach can establish identity between different omics data types such as Exome, RNA-Seq, and MethylCap-Seq. We demonstrate how identity detection degrades with sample quality and read coverage, but show that twenty million reads of a fairly low quality RNA-Seq sample are still sufficient for reliable sample identification.

Conclusion: Our tool, SMASH, is able to identify sample mismatches in next generation sequencing data sets between different sequencing modalities and for low quality sequencing data.

Keywords: Identity matching; Next generation sequencing data; Sample swap.

MeSH terms

  • Bayes Theorem
  • Genome, Human / genetics
  • Genomics / methods*
  • High-Throughput Nucleotide Sequencing
  • Humans
  • Polymorphism, Single Nucleotide / genetics*
  • Reproducibility of Results
  • Sequence Analysis, DNA
  • Software*