Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 41 (22), e202

HapFABIA: Identification of Very Short Segments of Identity by Descent Characterized by Rare Variants in Large Sequencing Data

Affiliations

HapFABIA: Identification of Very Short Segments of Identity by Descent Characterized by Rare Variants in Large Sequencing Data

Sepp Hochreiter. Nucleic Acids Res.

Abstract

Identity by descent (IBD) can be reliably detected for long shared DNA segments, which are found in related individuals. However, many studies contain cohorts of unrelated individuals that share only short IBD segments. New sequencing technologies facilitate identification of short IBD segments through rare variants, which convey more information on IBD than common variants. Current IBD detection methods, however, are not designed to use rare variants for the detection of short IBD segments. Short IBD segments reveal genetic structures at high resolution. Therefore, they can help to improve imputation and phasing, to increase genotyping accuracy for low-coverage sequencing and to increase the power of association studies. Since short IBD segments are further assumed to be old, they can shed light on the evolutionary history of humans. We propose HapFABIA, a computational method that applies biclustering to identify very short IBD segments characterized by rare variants. HapFABIA is designed to detect short IBD segments in genotype data that were obtained from next-generation sequencing, but can also be applied to DNA microarray data. Especially in next-generation sequencing data, HapFABIA exploits rare variants for IBD detection. HapFABIA significantly outperformed competing algorithms at detecting short IBD segments on artificial and simulated data with rare variants. HapFABIA identified 160 588 different short IBD segments characterized by rare variants with a median length of 23 kb (mean 24 kb) in data for chromosome 1 of the 1000 Genomes Project. These short IBD segments contain 752 000 single nucleotide variants (SNVs), which account for 39% of the rare variants and 23.5% of all variants. The vast majority-152 000 IBD segments-are shared by Africans, while only 19 000 and 11 000 are shared by Europeans and Asians, respectively. IBD segments that match the Denisova or the Neandertal genome are found significantly more often in Asians and Europeans but also, in some cases exclusively, in Africans. The lengths of IBD segments and their sharing between continental populations indicate that many short IBD segments from chromosome 1 existed before humans migrated out of Africa. Thus, rare variants that tag these short IBD segments predate human migration from Africa. The software package HapFABIA is available from Bioconductor. All data sets, result files and programs for data simulation, preprocessing and evaluation are supplied at http://www.bioinf.jku.at/research/short-IBD.

Figures

Figure 1.
Figure 1.
Biclustering of a genotyping matrix. Left: original genotyping data matrix, where rows give the individuals and columns consecutive SNVs. If at least one minor allele is present, then this is indicated by a violet bar for each individual–SNV pair, otherwise the bar is yellow. Right: after reordering the rows, a bicluster can be seen at the top three individuals. They contain the same IBD segment (in gold) and, therefore, are similar to each other by sharing minor alleles of SNVs within the segment (the tagSNVs).
Figure 2.
Figure 2.
The outer product formula image of vectors formula image and formula image. formula image indicates IBD segment tagSNVs and formula image how many chromosomes of an individual contain the IBD segment. The row containing 2s indicates a homozygous region represented by formula image (two times the same IBD segment in individual j).
Figure 3.
Figure 3.
Evaluation of IBD detection methods. Each column is an SNV. The upper row shows a true IBD segment and the lower row a detected IBD segment. The middle row indicates TP, FP, TN and FN.
Figure 4.
Figure 4.
For each IBD segment, the population with maximum proportion is determined. IBD segments are given for each matching genome, where the color indicates the population that has maximum proportion. For the human genome, 8000 random IBD segments are chosen. Almost half of the Neandertal matching IBD segments have ASN or EUR as maximal population proportions. The Archaic genome (Neandertal and Denisovan) shows also an enrichment of IBD segments that are found mostly in ASN or EUR.
Figure 5.
Figure 5.
For each genome and each IBD segment, the color indicates whether a population contains this segment (‘With’) or not (‘Without’). For the human genome, 8000 random IBD segments are chosen. IBD segments that match the Neandertal or the Archaic genome are found more often in ASN and EUR than all IBD segments (human). This effect is not as prominent for IBD segments that match the Denisovan genome.
Figure 6.
Figure 6.
Density of lengths of IBD segments that are private to ASN versus density of IBD segment lengths shared only by ASN and AFR. The Asian global peak is at 25 800 bp (red dashed line), while the global peak for AFR-ASN is at 22 000 bp (blue dashed line). Interestingly, the African-Asian IBD segments are older as the higher density between 3000 and 10 000 bp (blue area) shows.
Figure 7.
Figure 7.
Densities of lengths of IBD segments that match the Denisova genome and are private to AFR versus IBD segments that are not observed in AFR. The peak for AFR is at 10 000 bp, while IBD segment lengths that are not observed in AFR have peaks at 20 000 and 28 000 bp.
Figure 8.
Figure 8.
Densities of lengths of IBD segments that match the Neandertal genome and are enriched in a particular population. The dashed lines indicate the density peaks at 17 000 bp for AFR, 25 800 bp for ASN and 24 000 bp for EUR. Further, a smaller peak for both EUR and ASN is visible at 42 000 bp.
Figure 9.
Figure 9.
Example of an IBD segment matching the Denisova genome shared exclusively among ASN. The data analyzed by HapFABIA were phased genotypes from chromosome 1 of the 1000 Genomes Project. The rows give all chromosomes that contain the IBD segment and columns consecutive SNVs. If both chromosomes of an individual contain the IBD segment, then two adjacent identical row labels are present. Major alleles are shown in yellow, minor alleles of tagSNVs in violet and minor alleles of other SNVs in cyan. The row labeled ‘model L’ indicates tagSNVs identified by HapFABIA in violet. The rows ‘Ancestor’, ‘Neandertal’ and ‘Denisova’ show bases of the respective genomes in violet if they match the minor allele of the tagSNVs (in yellow otherwise). Neandertal tagSNV bases that are not called are shown in orange.

Similar articles

See all similar articles

Cited by 8 PubMed Central articles

See all "Cited by" articles

References

    1. Strachan T, Read AP. Human Molecular Genetics. 2004. 3rd edn. Garland Science/Taylor & Francis Group, London and New York.
    1. Browning SR, Browning BL. Identity by descent between distant relatives: detection and applications. Annu. Rev. Genet. 2012;46:617–633. - PubMed
    1. Browning BL, Browning SR. A unified approach to genotype imputation and haplotype phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet. 2009;84:210–223. - PMC - PubMed
    1. Stephens M, Scheet P. Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. Am. J. Hum. Genet. 2005;76:449–462. - PMC - PubMed
    1. Browning SR, Browning BL. Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 2011;12:703–714. - PMC - PubMed

Publication types

Feedback