Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 May 3;6(5):1287-96.
doi: 10.1534/g3.116.027581.

Conflation of Short Identity-by-Descent Segments Bias Their Inferred Length Distribution

Affiliations
Free PMC article

Conflation of Short Identity-by-Descent Segments Bias Their Inferred Length Distribution

Charleston W K Chiang et al. G3 (Bethesda). .
Free PMC article

Abstract

Identity-by-descent (IBD) is a fundamental concept in genetics with many applications. In a common definition, two haplotypes are said to share an IBD segment if that segment is inherited from a recent shared common ancestor without intervening recombination. Segments several cM long can be efficiently detected by a number of algorithms using high-density SNP array data from a population sample, and there are currently efforts to detect shorter segments from sequencing. Here, we study a problem of identifiability: because existing approaches detect IBD based on contiguous segments of identity-by-state, inferred long segments of IBD may arise from the conflation of smaller, nearby IBD segments. We quantified this effect using coalescent simulations, finding that significant proportions of inferred segments 1-2 cM long are results of conflations of two or more shorter segments, each at least 0.2 cM or longer, under demographic scenarios typical for modern humans for all programs tested. The impact of such conflation is much smaller for longer (> 2 cM) segments. This biases the inferred IBD segment length distribution, and so can affect downstream inferences that depend on the assumption that each segment of IBD derives from a single common ancestor. As an example, we present and analyze an estimator of the de novo mutation rate using IBD segments, and demonstrate that unmodeled conflation leads to underestimates of the ages of the common ancestors on these segments, and hence a significant overestimate of the mutation rate. Understanding the conflation effect in detail will make its correction in future methods more tractable.

Keywords: coalescent; human genetics; identity-by-descent.

Figures

Figure 1
Figure 1
A schematic relating IBDARG and IBDcalled segments. The cartoon shows the alignment of four haplotypes belonging to two diploid individuals. Across the region multiple ARGs exist to relate the four haplotypes in a tree. Across the first IBDARG region (orange, IBDARG,1), the two middle haplotypes have recent, unchanging, local graphs for the entire region. Across the second IBDARG region (red, IBDARG,2), two different haplotypes have recent, unchanging, local graphs for the entire region. The two IBDARG segments happens to occur near each other such that the entire region may be detected by algorithms based on long stretches of sequence similarity (IBDcalled segment). ARG, ancestral recombination graph; IBD, identity-by-descent.
Figure 2
Figure 2
The prevalence of subsegments among algorithm-detected IBD segments. Each of a set of 250 randomly chosen IBDcalled segments detected by Refined IBD is represented by a vertical bar. The IBDcalled segments are sorted along the x-axis according to the detected length in decreasing order. For each IBDcalled segment, the longest intersecting IBDARG segment is shown in yellow. The second longest intersecting IBDARG subsegment, if present, is shown in blue. The overlap between the two longest IBDARG segments, if any, is shown in olive green. The remainder of the detected region is clumped in black. For each IBDcalled segment displayed, we also show the number of subsegments > 0.2 cM detected in simulation using the vertical axis on the right. For completeness, we also display cases where the second IBDARG segment is completely overlapping the longest IBDARG segment, in which case it would not confound the calling algorithm. See Figure S3 for results based on IBDcalled segments detected by GERMLINE, fastIBD, and IBDLD. ARG, ancestral recombination graph; IBD, identity-by-descent.
Figure 3
Figure 3
The conflation effect as a function of the length of IBDcalled segments. (A) The complementary cumulative distribution functions (i.e., 1-CDF) for the total length extended due to subsegments > 0.2 cM (except for the longest IBDARG segment). The distributions are also stratified by four levels of length: Between 1–1.25 cM, between 1.25–1.5 cM, between 1.5–1.75 cM, and > 1.75 cM. The conflation effect is generally driven by segments < 1.75 cM in detected length. (B) The biases in estimated length due to subsegments and end point errors as a function of the estimated length. We binned all IBDcalled segments in 7 bins: [1, 1.2), [1.2, 1.4), [1.4, 1.6), [1.6, 1.8), [1.8, 2), [2, 2.2), [2.2, 20), and for each bin examined the average length extended (from both ends) beyond the longest IBDARG segment found in the called region due to either a subsegment > 0.2 cM (blue), or other minor endpoint errors and gaps between subsegments (black). Each data point is plotted on the x-axis at the median length of the bin. ARG, ancestral recombination graph; CDF, cumulative distribution function; IBD, identity-by-descent.
Figure 4
Figure 4
An illustrative example of the conflation effect on the mutation rate estimator. For a particular IBDcalled segment of length 1.145 cM, we show the distribution of IBDARG subsegment age (y-axis) as a function of position (x-axis, between 9.45–10.59 Mb of a 20 Mb simulated region). Each of the four different pairwise haplotype configurations between the two diploid samples is illustrated with a different color. The simulated haplotype numbers are displayed in the upper right hand corner. The vertical dashed lines demarcate the 10% segment length from both ends of the segment that one could remove from analysis due to the uncertainty in estimating the ends of the IBDcalled segments. The age of each subsegment is plotted as a step function of its length. In this case, the IBD region is dominated by two long segments of IBD, one between simulated haplotypes 1865 and 654, another between simulated haplotypes 911 and 654. (There is actually a third, very short, segment of recent coalescence between simulated haplotypes 911 and 654 that is not obvious here.) Regions that do not produce long IBD segments can be clearly seen with the deep coalescences. In this case, the predominate IBD haplotype should be between haplotypes 1865 and 654, but the conflation with a neighboring IBD haplotype between haplotypes 911 and 654 led to the estimation of a single long IBD segment. ARG, ancestral recombination graph; IBD, identity-by-descent.
Figure 5
Figure 5
Conflations of shorter IBD segments will bias the length distribution. (A) For each bin of segment length range, we calculated the rate at which two IBDARG segments in our simulations, both within the length range, are adjacent and together constitute an end-to-end length of at least 1 cM (i.e., maximum gap sizes = 0 cM and combined length > 1 cM). The blue dot is the actual value observed in simulation. The boxplot shows the variance around the observed value by randomly sampling from the observed segment length distribution but randomly assigning the location of a segment and sample IDs 100 times. (B) The biased length distribution if each conflated IBDARG segment is counted for its conflated length rather than the two true lengths. Note that the apparent length of each conflated segment is due to conflation of two IBDARG segments, independent of any imprecision due to algorithm calling. Dotted line is the true length distribution if each conflated segment can be resolved based on the coalescent genealogy. Inset shows the comparison between the biased length distribution and the true length distribution in log scale. For results based on a maximum gap size of 0.01 cM, refer to Figure S6. ARG, ancestral recombination graph; IBD, identity-by-descent.

Similar articles

See all similar articles

Cited by 6 articles

See all "Cited by" articles

References

    1. Albrechtsen A., Sand Korneliussen T., Moltke I., van Overseem Hansen T., Nielsen F. C., et al. , 2009. Relatedness mapping and tracts of relatedness for genome-wide data in the presence of linkage disequilibrium. Genet. Epidemiol. 33: 266–274. - PubMed
    1. Albrechtsen A., Moltke I., Nielsen R., 2010. Natural selection and the distribution of identity-by-descent in the human genome. Genetics 186: 295–308. - PMC - PubMed
    1. Browning B. L., Browning S. R., 2011. A fast, powerful method for detecting identity by descent. Am. J. Hum. Genet. 88: 173–182. - PMC - PubMed
    1. Browning B. L., Browning S. R., 2013a Detecting identity by descent and estimating genotype error rates in sequence data. Am. J. Hum. Genet. 93: 840–851. - PMC - PubMed
    1. Browning B. L., Browning S. R., 2013b Improving the accuracy and efficiency of identity-by-descent detection in population data. Genetics 194: 459–471. - PMC - PubMed

Publication types

LinkOut - more resources

Feedback