Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Nov 7;93(5):840-51.
doi: 10.1016/j.ajhg.2013.09.014. Epub 2013 Oct 24.

Detecting identity by descent and estimating genotype error rates in sequence data

Affiliations

Detecting identity by descent and estimating genotype error rates in sequence data

Brian L Browning et al. Am J Hum Genet. .

Abstract

Existing methods for identity by descent (IBD) segment detection were designed for SNP array data, not sequence data. Sequence data have a much higher density of genetic variants and a different allele frequency distribution, and can have higher genotype error rates. Consequently, best practices for IBD detection in SNP array data do not necessarily carry over to sequence data. We present a method, IBDseq, for detecting IBD segments in sequence data and a method, SEQERR, for estimating genotype error rates at low-frequency variants by using detected IBD. The IBDseq method estimates probabilities of genotypes observed with error for each pair of individuals under IBD and non-IBD models. The ratio of estimated probabilities under the two models gives a LOD score for IBD. We evaluate several IBD detection methods that are fast enough for application to sequence data (IBDseq, Beagle Refined IBD, PLINK, and GERMLINE) under multiple parameter settings, and we show that IBDseq achieves high power and accuracy for IBD detection in sequence data. The SEQERR method estimates genotype error rates by comparing observed and expected rates of pairs of homozygote and heterozygote genotypes at low-frequency variants in IBD segments. We demonstrate the accuracy of SEQERR in simulated data, and we apply the method to estimate genotype error rates in sequence data from the UK10K and 1000 Genomes projects.

PubMed Disclaimer

Figures

Figure 1
Figure 1
IBD Detection Power and Accuracy with IBDseq Power (proportion detected) is the average proportion of a true IBD segment of given length that overlaps with reported IBD segments. Accuracy (probability a segment is true) is the proportion of reported segments of given length for which there is a true segment that overlaps at least half of the reported segment. Results are binned by segment size: bins extend 0.05 cM on either side of the x axis value for x axis values ≤1 cM; 0.1 cM either side for x axis values ≤2 cM; and 0.5 cM either side for x axis values >2 cM.
Figure 2
Figure 2
Power and Accuracy with IBDseq, Refined IBD, GERMLINE, and PLINK See Figure 1 legend for definitions of axis labels.
Figure 3
Figure 3
Comparing IBD Detection across Methods The value on the y axis (rate of detected IBD) is determined by finding for each reported IBD segment the length of the overlap between the reported IBD segment and the best-matching (defined in Material and Methods) true IBD segment. If no true IBD segment overlaps the reported IBD segment, the amount of overlap is zero. Detection rate is the sum of all such overlap lengths divided by the number of pairs of individuals analyzed and by the total length of the regions analyzed. The value on the x axis (rate of false-positive IBD) is the sum of the lengths of false reported IBD segments divided by the number of pairs of individuals analyzed and by the total length of the regions analyzed. A reported segment is considered to be false if there is no true IBD segment that overlaps at least half of the reported segment.
Figure 4
Figure 4
Over- and Underestimation of IBD Segment Lengths Differences between estimated and actual segment lengths were calculated for all reported IBD segments, and probability densities of these differences were estimated with a Gaussian kernel.
Figure 5
Figure 5
Genotype Error Estimation in Simulated Data Estimated genotype error rates obtained from SEQERR are points; the solid line is the actual genotype error rate plotted against the observed error-added MAF. For the lowest MAFs each point is for a single minor allele count value; for higher MAFs several minor allele counts are combined to reduce noise.
Figure 6
Figure 6
Genotype Error Rate Estimation in the UK10K Sequence Data The solid line shows the genotype error rate estimated by SEQERR and the dashed line shows half the average genotype discordance in 18 pairs of duplicate samples.
Figure 7
Figure 7
Estimated False Call Rate for Called Heterozygote Genotypes at Low-Frequency Variants in the UK10K Data The x axis is the observed minor allele count, shown on a log scale. The y axis is the estimated genotype error rate divided by twice the observed MAF.

Similar articles

Cited by

References

    1. The 1000 Genomes Project Consortium An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65. - PMC - PubMed
    1. Browning B.L., Browning S.R. Improving the accuracy and efficiency of identity-by-descent detection in population data. Genetics. 2013;194:459–471. - PMC - PubMed
    1. Browning B.L., Browning S.R. A fast, powerful method for detecting identity by descent. Am. J. Hum. Genet. 2011;88:173–182. - PMC - PubMed
    1. Gusev A., Lowe J.K., Stoffel M., Daly M.J., Altshuler D., Breslow J.L., Friedman J.M., Pe’er I. Whole population, genome-wide mapping of hidden relatedness. Genome Res. 2009;19:318–326. - PMC - PubMed
    1. Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M.A.R., Bender D., Maller J., Sklar P., de Bakker P.I.W., Daly M.J., Sham P.C. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007;81:559–575. - PMC - PubMed

Publication types