Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 10 (1), 1784

Multi-platform Discovery of Haplotype-Resolved Structural Variation in Human Genomes

Mark J P Chaisson  1   2 Ashley D Sanders  3 Xuefang Zhao  4   5 Ankit Malhotra  6 David Porubsky  7   8 Tobias Rausch  3 Eugene J Gardner  9 Oscar L Rodriguez  10 Li Guo  11   12   13 Ryan L Collins  5   14 Xian Fan  15 Jia Wen  16 Robert E Handsaker  17   18   19 Susan Fairley  20 Zev N Kronenberg  1 Xiangmeng Kong  21   22 Fereydoun Hormozdiari  23   24 Dillon Lee  25 Aaron M Wenger  26 Alex R Hastie  27 Danny Antaki  28 Thomas Anantharaman  27 Peter A Audano  1 Harrison Brand  5 Stuart Cantsilieris  1 Han Cao  27 Eliza Cerveira  6 Chong Chen  15 Xintong Chen  9 Chen-Shan Chin  26 Zechen Chong  15 Nelson T Chuang  9 Christine C Lambert  26 Deanna M Church  29 Laura Clarke  20 Andrew Farrell  25 Joey Flores  30 Timur Galeev  21   22 David U Gorkin  31   32 Madhusudan Gujral  28 Victor Guryev  7 William Haynes Heaton  29 Jonas Korlach  26 Sushant Kumar  21   22 Jee Young Kwon  6   33 Ernest T Lam  27 Jong Eun Lee  34 Joyce Lee  27 Wan-Ping Lee  6 Sau Peng Lee  35 Shantao Li  21   22 Patrick Marks  29 Karine Viaud-Martinez  30 Sascha Meiers  3 Katherine M Munson  1 Fabio C P Navarro  21   22 Bradley J Nelson  1 Conor Nodzak  16 Amina Noor  28 Sofia Kyriazopoulou-Panagiotopoulou  29 Andy W C Pang  27 Yunjiang Qiu  32   36 Gabriel Rosanio  28 Mallory Ryan  6 Adrian Stütz  3 Diana C J Spierings  7 Alistair Ward  25 AnneMarie E Welch  1 Ming Xiao  37 Wei Xu  29 Chengsheng Zhang  6 Qihui Zhu  6 Xiangqun Zheng-Bradley  20 Ernesto Lowy  20 Sergei Yakneen  3 Steven McCarroll  17   18   19 Goo Jun  38 Li Ding  39 Chong Lek Koh  40 Bing Ren  31   32 Paul Flicek  20 Ken Chen  15 Mark B Gerstein  21   22   41   42 Pui-Yan Kwok  43 Peter M Lansdorp  7   44   45 Gabor T Marth  25 Jonathan Sebat  28   31   46 Xinghua Shi  16 Ali Bashir  10 Kai Ye  12   13   47 Scott E Devine  9 Michael E Talkowski  5   19   48 Ryan E Mills  4   49 Tobias Marschall  8 Jan O Korbel  50   51 Evan E Eichler  52   53 Charles Lee  54   55
Affiliations

Multi-platform Discovery of Haplotype-Resolved Structural Variation in Human Genomes

Mark J P Chaisson et al. Nat Commun.

Abstract

The incomplete identification of structural variants (SVs) from whole-genome sequencing data limits studies of human genetic diversity and disease association. Here, we apply a suite of long-read, short-read, strand-specific sequencing technologies, optical mapping, and variant discovery algorithms to comprehensively analyze three trios to define the full spectrum of human genetic variation in a haplotype-resolved manner. We identify 818,054 indel variants (<50 bp) and 27,622 SVs (≥50 bp) per genome. We also discover 156 inversions per genome and 58 of the inversions intersect with the critical regions of recurrent microdeletion and microduplication syndromes. Taken together, our SV callsets represent a three to sevenfold increase in SV detection compared to most standard high-throughput sequencing studies, including those from the 1000 Genomes Project. The methods and the dataset presented serve as a gold standard for the scientific community allowing us to make recommendations for maximizing structural variation sensitivity for future genome sequencing studies.

Conflict of interest statement

J.K., C.-S.C., C.C.L., and A.M.W. are employees and shareholders of Pacific Biosciences (aka PacBio); A.R.H., T.A., H.C., E.T.L., J.L., and A.W.C.P. are employees and shareholders of Bionano Genomics; D.M.C., W.H.H., P.M., S.K.-P., and W.X. are employees and shareholders of 10X Genomics; J.F. is an employee of Illumina; J.E.L. is an employee of DNALink; S.P.L. is an employee of TreeCode Sdn Bhd. P.F. is a member of the scientific advisory board (SAB) of Fabric Genomics, Inc., and Eagle Genomics, Ltd. E.E.E. is on the SAB of DNAnexus, Inc. and was a consultant for Kunming University of Science and Technology (KUST) as part of the 1000 China Talent Program (2014–2016). C.L. was on the SAB of Bionano Genomics. All other authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Characteristics of SNV-based haplotypes obtained from different data sources. a Distribution of phased block lengths for the YRI child NA19240. Note that Strand-seq haplotypes span whole chromosomes and therefore one block per chromosome is shown. Vertical bars highlight N50 haplotype length: the minimum length haplotype block at which at least half of the phased bases are contained. For Illumina (IL) paired-end data, phased blocks cover <50% of the genome and hence the N50 cannot be computed. b Fraction of phase connection, i.e., pairs of consecutive heterozygous variants provided by each technology (averaged over all proband samples). c Pairwise comparisons of different phasings; colors encode switch error rates (averaged over all proband samples). For each row, a green box indicates the phasing of an independent technology with best agreement, with corresponding switch error rates given in green. d Each phased block is shown in a different color. The largest block is shown in cyan, i.e., all cyan regions belong to one block, even though interspaced by white areas (genomic regions where no variants are phased) or disconnected small blocks (different colors). e Fraction of heterozygous SNVs in the largest block shown in d. f Mismatch error rate of largest block compared to trio-based phasing, averaged over all chromosomes of all proband genomes (i.e., the empirical probability that any two heterozygous variants on a chromosome are phased correctly with respect to each other, in contrast to the switch error rate, which relays the probability that any two adjacent heterozygous variants are phased correctly). (*) Not available because trio phasing is used as reference for comparisons. (**) Not shown as population-based phasing does not output block boundaries; refer to Supplementary Material for an illustration of errors in population-based phasing
Fig. 2
Fig. 2
Comparison and integration of indel and SV callsets on HG00733, HG00514, and NA12940. a Length distribution of deletions and insertions identified by PB (blue), IL (red) and BNG (brown), respectively, together with averaged length distribution of SVs discovered in the maternal genomes by the 1KG-P3 report (silver). b Number of SVs discovered by one or multiple genome platforms in the YRI child NA19240. c Overlap of IL indel discovery algorithms, with total number of indels found by each combination of IL algorithms (gray) and those that overlapped with a PB indel (blue) in the CHS child HG00514
Fig. 3
Fig. 3
Characterization of simple and complex inversions. a Integration of inversions across platforms based on reciprocal overlap. Shown is an example of five orthogonal platforms intersecting at a homozygous inversion, with breakpoint ranges and supporting Strand-seq signature illustrated in bottom panels. b Size distribution of inversions included in the unified inversion list, subdivided by technology, with the total inversions (N) contributed by each listed. c Classification of Strand-seq inversions based on orthogonal phase support. Illustrative examples of simple (homozygous and heterozygous) and complex (inverted duplication) events are shown. Strand-seq inversions were identified based on read directionality (read count; reference reads in gray, inverted reads in purple), the relative ratio of reference to inverted reads within the locus (read ratio), and the haplotype structure of the inversion, with phased read data considered in terms of directionality (Ph; H1 alleles in red, H2 alleles in blue; alleles from reference reads are displayed above the ideogram and alleles from inverted reads are displayed below). ILL Illumina. liWGS long-insert whole-genome sequencing libraries. PB Pacific Biosciences. StS Strand-seq. BNG Bionano Genomics. SD segmental duplication. Ph phase data
Fig. 4
Fig. 4
Concordance of IL methods compared against total IL callset and PB callset using orthogonal technologies. Results by algorithm shown for a the deletion concordance for individual methods, b the union of all pairs of methods, and c the requirement that more than one caller agree on any call. Individual callers are shown as red points for comparison. Pairs and triples of combinations are in black points. The solid and dashed lines represent the 5% and 10% non-concordance rates (NCR), respectively. The top five combinations of methods in each plot below the 10% NCR, along with the individual plots, are labeled. d Overlap of IL-SV discovery algorithms, with total number of SVs found by each combination of IL algorithms (gray) and those that overlapped with the PB-SV calls (blue) in the YRI child NA19240. e PCA of the genotypes of concordant calls of each method: PC 1 versus 2 (left), PC 2 versus 3 (right). VH VariationHunter

Similar articles

See all similar articles

Cited by 44 PubMed Central articles

See all "Cited by" articles

References

    1. Conrad DF, et al. Origins and functional impact of copy number variation in the human genome. Nature. 2010;464:704–712. doi: 10.1038/nature08516. - DOI - PMC - PubMed
    1. Kidd JM, et al. A human genome structural variation sequencing resource reveals insights into mutational mechanisms. Cell. 2010;143:837–847. doi: 10.1016/j.cell.2010.10.027. - DOI - PMC - PubMed
    1. Korbel JO, et al. Paired-end mapping reveals extensive structural variation in the human genome. Science. 2007;318:420–426. doi: 10.1126/science.1149504. - DOI - PMC - PubMed
    1. Sudmant PH, et al. An integrated map of structural variation in 2,504 human genomes. Nature. 2015;526:75–81. doi: 10.1038/nature15394. - DOI - PMC - PubMed
    1. Kronenberg ZN, et al. Wham: identifying structural variants of biological consequence. PLoS Comput. Biol. 2015;11:e1004572. doi: 10.1371/journal.pcbi.1004572. - DOI - PMC - PubMed

Publication types

Feedback