Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jul 14;12(1):62.
doi: 10.1186/s13073-020-00761-2.

Somalier: rapid relatedness estimation for cancer and germline studies using efficient genome sketches

Affiliations

Somalier: rapid relatedness estimation for cancer and germline studies using efficient genome sketches

Brent S Pedersen et al. Genome Med. .

Abstract

Background: When interpreting sequencing data from multiple spatial or longitudinal biopsies, detecting sample mix-ups is essential, yet more difficult than in studies of germline variation. In most genomic studies of tumors, genetic variation is detected through pairwise comparisons of the tumor and a matched normal tissue from the sample donor. In many cases, only somatic variants are reported, which hinders the use of existing tools that detect sample swaps solely based on genotypes of inherited variants. To address this problem, we have developed Somalier, a tool that operates directly on alignments and does not require jointly called germline variants. Instead, Somalier extracts a small sketch of informative genetic variation for each sample. Sketches from hundreds of germline or somatic samples can then be compared in under a second, making Somalier a useful tool for measuring relatedness in large cohorts. Somalier produces both text output and an interactive visual report that facilitates the detection and correction of sample swaps using multiple relatedness metrics.

Results: We introduce the tool and demonstrate its utility on a cohort of five glioma samples each with a normal, tumor, and cell-free DNA sample. Applying Somalier to high-coverage sequence data from the 1000 Genomes Project also identifies several related samples. We also demonstrate that it can distinguish pairs of whole-genome and RNA-seq samples from the same individuals in the Genotype-Tissue Expression (GTEx) project.

Conclusions: Somalier is a tool that can rapidly evaluate relatedness from sequencing data. It can be applied to diverse sequencing data types and genome builds and is available under an MIT license at github.com/brentp/somalier .

PubMed Disclaimer

Conflict of interest statement

Brent S. Pedersen and Aaron R. Quinlan are co-founders of Base2 Genomics. The remaining authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Comparing genotype sketches to compute relatedness measures for pairs of samples. a Observed counts for the reference (Ref.) and alternate (Alt.) allele at each of the tested 17,766 loci are converted into genotypes (see main text for details) to create a “sketch” for each sample. b The genotypes for each sample are then converted into three bit vectors: one for homozygous reference (HOMREF) genotypes, one for heterozygous (HET) genotypes, and one for homozygous alternate (HOMALT) genotypes. The length of each vector is the total number of autosomal variants in the sketch (i.e., 17,384) divided by 64, and the value for each bit is set to 1 if the sample has the particular genotype at the given variant site. For example, four variant sites are shown in b and the hypothetical individual has a homozygous alternate genotype for the second variant (the corresponding bit is set to 1), but is not homozygous for the alternate allele at the other three variant sites (the corresponding bits are set to 0). c The bit vectors for a pair of samples can be easily compared to calculate relatedness measures such as identity-by-state zero (IBS0, where zero alleles are shared between two samples) through efficient, bitwise operations on the bit arrays for the relevant genotypes
Fig. 2
Fig. 2
Glioma samples before and after correction. a A comparison of the IBS0 (number of sites where 1 sample is homozygous reference and another is homozygous alternate) and IBS2 (count of sites where samples have the same genotype) metric for 15 samples. Each point is a pair of samples. Points are positioned by the values calculated from the alignment files (observed relatedness) and colored by whether they are expected to be identical (expected relatedness), as indicated from the command line. In this case, sample swaps are visible as orange points that cluster with green points, and vice versa. The user is able to hover on each point to see the sample pair involved and to change the X and Y axes to any of the metrics calculated by Somalier. b An updated version of the plot in a after the sample identities have been corrected (per the information provided by a) in the manifest after re-running Somalier
Fig. 3
Fig. 3
Relatedness plot for thousand genomes samples. Each dot represents a pair of samples. IBS0 on the x-axis is the number of sites where 1 sample is homozygous for the reference allele and the other is homozygous for the alternate allele. IBS2, on the y-axis, is the count of sites where a pair of samples were both homozygous or both heterozygous. Points with IBS0 of 0 are parent-child pairs. The 4 points with IBS0 > 0 and IBS0 < 450 are siblings. There are also several more distantly related sample pairs
Fig. 4
Fig. 4
Sex quality control on thousand genomes samples. Each point is a sample colored as orange if the sample is indicated as female and green if it is indicated as male; all data is for the X chromosome. a The number of homozygous alternate sites on the x-axis and the number of heterozygous sites on the y-axis. Males and females separate with few exceptions. b The number of homozygous alternate sites on the x-axis compared to the mean depth on the Y chromosome. Males and females reported in the manifest separate perfectly, indicating that some females may have experienced a complete loss of the X chromosome

Similar articles

Cited by

References

    1. Cibulskis K, et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol. 2013;31:213–219. doi: 10.1038/nbt.2514. - DOI - PMC - PubMed
    1. McKenna A, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–1303. doi: 10.1101/gr.107524.110. - DOI - PMC - PubMed
    1. Pedersen, B. S. & Quinlan, A. R. Who’s who? detecting and resolving sample anomalies in human DNA sequencing studies with peddy. Am J Hum Genet (2017). 10.1016/j.ajhg.2017.01.017. - PMC - PubMed
    1. Manichaikul A, et al. Robust relationship inference in genome-wide association studies. Bioinformatics. 2010;26:2867–2873. doi: 10.1093/bioinformatics/btq559. - DOI - PMC - PubMed
    1. Danecek P, et al. The variant call format and VCFtools. Bioinformatics. 2011;27:2156–2158. doi: 10.1093/bioinformatics/btr330. - DOI - PMC - PubMed

Publication types

LinkOut - more resources