Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Mar;43(3):269-76.
doi: 10.1038/ng.768. Epub 2011 Feb 13.

Discovery and Genotyping of Genome Structural Polymorphism by Sequencing on a Population Scale

Affiliations
Free PMC article

Discovery and Genotyping of Genome Structural Polymorphism by Sequencing on a Population Scale

Robert E Handsaker et al. Nat Genet. .
Free PMC article

Abstract

Accurate and complete analysis of genome variation in large populations will be required to understand the role of genome variation in complex disease. We present an analytical framework for characterizing genome deletion polymorphism in populations using sequence data that are distributed across hundreds or thousands of genomes. Our approach uses population-level concepts to reinterpret the technical features of sequence data that often reflect structural variation. In the 1000 Genomes Project pilot, this approach identified deletion polymorphism across 168 genomes (sequenced at 4 × average coverage) with sensitivity and specificity unmatched by other algorithms. We also describe a way to determine the allelic state or genotype of each deletion polymorphism in each genome; the 1000 Genomes Project used this approach to type 13,826 deletion polymorphisms (48-995,664 bp) at high accuracy in populations. These methods offer a way to relate genome structural polymorphism to complex disease in populations.

Figures

Figure 1
Figure 1
A population-aware analytical framework for analyzing Genome STRucture in Populations (Genome STRiP). (a) Population-scale sequence data contain two classes of information: technical features of the sequence data within a genome, and population-scale patterns that span all the genomes analyzed. Technical features include breakpoint-spanning reads ,, paired-end sequences -, and local variation in read depth of coverage -. Genome STRiP combines these with population-scale patterns that span many genomes, including: the sharing of structural alleles by multiple genomes; the pattern of sequence heterogeneity within a population; the substitution of alternative structural alleles for each other; and the haplotype structure of human genome polymorphism. (b) Goals of structural variation (SV) analysis in Genome STRiP. Variation discovery involves identifying the structural alleles that are segregating in a population. The power to observe a variant in any one genome is only partial, but the evidence defining a segregating site can be derived from many genomes at once. Population genotyping requires accurately determining the allelic state of each variant in every diploid genome in a population.
Figure 2
Figure 2
Identifying coherent sets of aberrantly mapping reads from a population of genomes. (a) Millions of end-sequence pairs from sequencing libraries show aberrant alignment locations, appearing to span vast genomic distances. Almost all of these observations derive not from true structural variants but from chimeric inserts in molecular sequencing libraries. Data shown: paired-end alignments on chromosome 5, from 41 initial genome sequencing libraries from the 1000 Genomes Project. (b) A set of “coherently aberrant” end-sequence pairs from many genomes. At this genomic locus, paired-end sequences (sequences of the two ends of the inserts in a molecular library) fall into two classes: (i) end-sequence pairs that show the genomic spacing expected given the insert size distribution of each sequencing library, such as the three read-pair alignments for genome NA07037; and (ii) end-sequence pairs that align to genomic locations unexpectedly far apart, but which relate to their expected insert size distributions by a shared correction factor (red arrows). A unifying model in which these eight read pairs from five genomes arise from a shared deletion allele (size of red arrows) converts all of these aberrant read pairs to likely observations. (In right panel, black tick marks indicate genomic distance between left and right end sequences; black curves indicate insert size distributions of the molecular library from which each sequence-pair is drawn.)
Figure 3
Figure 3
Evaluating the population-heterogeneity and allele-substitution properties of population-scale sequence data. (a) At a candidate deletion locus, the distribution across genomes of “evidentiary reads” (read-pairs suggesting the presence of a deletion allele at a locus) (blue bars) is compared to a null model under which genomes are equally likely, per molecule sequenced, to give rise to such evidentiary reads (green curve). For the locus shown, the distribution of evidentiary reads across genomes differs from the null distribution (p = 1 × 10-4), confirming that evidentiary sequence data appears differentially within the population at this locus. (b) At another genomic locus, putative SV-supporting read pairs arise from many genomes but in a pattern that does not significantly differ from a null distribution based on equal probability per molecule sequenced. Subsequent assays confirmed that this is not a true deletion. (c) Distribution of a population-heterogeneity statistic (from a,b) for read-pair data at 1,420 sites of known deletion polymorphism. (d) Distribution of the same population-heterogeneity statistic from read-pair data at 45 thousand candidate deletion loci nominated by read-pair analysis. (e,f) If a putative deletion is real, then genomes with molecular evidence for the deletion allele would be expected to have less evidence for the reference allele (“allelic substitution”). A simple test of allelic substitution is to compare average read depth (across a putative deletion segment) between two subpopulations – the genomes with read-pair evidence for the deletion (blue curve), and the genomes lacking such evidence (black trace). The locus in (e) was subsequently validated as containing a real deletion; the locus in (f) was not. (g) Distribution of this “subpopulation depth ratio” statistic (e,f) for sequence data at 1,420 sites of known deletion polymorphism. (h) Distribution of the same statistic for sequence data at 45 thousand candidate deletion loci.
Figure 4
Figure 4
Deletion polymorphisms identified by Genome STRiP in low-coverage sequence data from 168 genomes. (a) Size distribution. Sensitivity for large deletions (>10 kb) is similar to that of the array-based approaches applied in large, population-scale studies (red); sensitivity for deletions smaller than 10 kb is much greater. A strong peak near 300 bp arises from ALU insertion polymorphisms; a smaller peak near 6 Kb arises from L1 insertion polymorphisms. (b,c) Number of evidentiary sequence reads (b) and genomes (c) contributing to each deletion discovery in population-scale sequence data. 1,033 of these deletions (14.7%) were identified with evidentiary pairs from individual genomes. (d) Specificity: false discovery rates of ten deletion discovery methods evaluated by the 1000 Genomes Project in the Project’s population-scale low-coverage sequence data. (e) Sensitivity: power of the same ten discovery methods for identifying known deletions, as a function of the allele frequency of the deletion. (f) Localization of the breakpoints of a common deletion allele using read-pair data from many genomes. The difference between (i) the genomic separation of each read-pair sequence and (ii) the insert-size distribution of the molecular library from which is it drawn (Fig. 2b) allows a likelihood-based estimate of deletion length from each read pair (blue curves). Combining this likelihood information across many genomes (black curve) allows fine-scale localization of the breakpoint. (g) Resolution of breakpoint estimates from Genome STRiP, as estimated using Genome STRiP confidence intervals (red) and comparison to molecularly established breakpoint sequences (blue). (h) Fine-scale localization of an SV breakpoint facilitates directed local assembly of the deletion allele from sequence data derived from many genomes.
Figure 5
Figure 5
Determining the allelic state (genotype) of 13,826 deletions in 156 genomes. (a) Four of the 13,826 deletion polymorphisms analyzed, representing diverse properties in terms of size and alignability of the affected sequence. Grey vertical rectangles indicate sequence that is repeat-masked or otherwise non-alignable. The locus in the bottom row is an ALU insertion polymorphism. (b) Population-scale distribution of read depth across genomes, at each of the deletion loci in (a). For each locus, normalized measurements of read depth (across the deleted segment) from 156 genomes are fitted to a Gaussian mixture model. Colored squares represent genomes for which genotype could be called at 95% confidence based on read depth. (c) Genotype likelihood from read depth. Each horizontal stripe (corresponding to one of the 156 genomes) is divided into three sections with length proportional to the estimated relative likelihood of the sequence data given each genotype model (blue: copy-number 2; green: copy-number 1; orange: copy-number 0). (d) Genotype likelihood based on evidence from read pairs (RP) and breakpoint-spanning reads (BR). At the third locus from top, the absence of an established breakpoint sequence limits inference to read pairs. (e) Genotype likelihood based on integrating evidence from read depth (RD), read pairs (RP) and breakpoint-spanning reads (BR). (f) Genotype likelihood based on integrating evidence from (c-e) with flanking SNP data in a population haplotype model. (g) Population-scale sequence data at each locus, as resolved into genotype classes. Traces indicate average read depth for genomes of each inferred genotype. Orange and green rectangles indicate evidentiary read pairs and breakpoint-spanning reads, colored by the genotype determination for the genome from which they arise.

Similar articles

See all similar articles

Cited by 160 articles

See all "Cited by" articles

References

    1. 1000_Genomes_Project_Consortium. A map of human genome variation from population scale sequencing. Nature. 2010;467:1061–1073. - PMC - PubMed
    1. Ye K, Schulz MH, Long Q, Apweiler R, Ning Z. Pindel a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics. 2009;25:2865–2871. - PMC - PubMed
    1. Lam HY, et al. Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library. Nat Biotechnol. 2010;28:47–55. - PMC - PubMed
    1. Korbel JO, et al. Paired-end mapping reveals extensive structural variation in the human genome. Science. 2007;318:420–426. - PMC - PubMed
    1. Korbel JO, et al. PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data. Genome Biol. 2009;10:R23. - PMC - PubMed

Publication types

Feedback