Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Apr 1;352(6281):aae0344.
doi: 10.1126/science.aae0344.

Long-read Sequence Assembly of the Gorilla Genome

Affiliations
Free PMC article

Long-read Sequence Assembly of the Gorilla Genome

David Gordon et al. Science. .
Free PMC article

Abstract

Accurate sequence and assembly of genomes is a critical first step for studies of genetic variation. We generated a high-quality assembly of the gorilla genome using single-molecule, real-time sequence technology and a string graph de novo assembly algorithm. The new assembly improves contiguity by two to three orders of magnitude with respect to previously released assemblies, recovering 87% of missing reference exons and incomplete gene models. Although regions of large, high-identity segmental duplications remain largely unresolved, this comprehensive assembly provides new biological insight into genetic diversity, structural variation, gene loss, and representation of repeat structures within the gorilla genome. The approach provides a path forward for the routine assembly of mammalian genomes at a level approaching that of the current quality of the human genome.

Figures

Fig. 1
Fig. 1. Gorilla genome assembly
(A) Schematic depicting assembly contig lengths (contig N50 = 9.6 Mbp) mapped to human GRCh38 chromosomes. The first two rows of black rectangles represent contigs >3 Mbp, the blue rectangles correspond to contigs ≤3 Mbp, and red rectangles correspond to blocks of human/gorilla segmental duplications >100 kbp. (B) Mappability and satellite content of Susie3 contigs. Satellite content defined by use of RepeatMasker (28) and Tandem Repeats Finder (29). Contigs that are unable to map to GRCh38 by using BLASR (colored red) (30) contain a high fraction of satellite sequence. (C) Length distribution of gaps in the published gorilla assembly gorGor3 closed by Susie3 and containing exons or regulatory regions. Of the gaps in gorGor3, 94% were closed in Susie3, with thousands corresponding to missing exons (red) and putative noncoding regulatory DNA (blue).
Fig. 2
Fig. 2. Gorilla genome ideogram
Schematic depicting assembly contig lengths mapped to gorilla chromosomes. The first two rows of black rectangles represent contigs >3 Mbp, the green rectangles correspond to contigs >1 Mbp and ≤3 Mbp, and blue rectangles correspond to contigs ≤1 Mbp.
Fig. 3
Fig. 3. Comparison of gorilla genome assemblies
The contig length distribution for the resulting long-read assembly (Susie3) is 2 to 3 orders of magnitude larger when compared with previous gorilla genome assemblies (gorGor3 and gorGor4) that were generated by using Illumina and Sanger sequencing technology.
Fig. 4
Fig. 4. Gene annotation and structural variation
(A) Proportion of GENCODE transcripts with assembly errors when aligned with gorilla assemblies Susie3 and gorGor3, and three reference assemblies, including orangutan (ponAbe2), chimpanzee (panTro4), and squirrel monkey (saiBol1). Examples of assembly errors include transcript mappings extending off the end of contigs/scaffolds, containing unknown bases, or incomplete transcript mapping. (B) An example of a gene, otoancorin (OTOA), with complete exon representation (red ticks) resolved in the new assembly. Red bars on gorGor3 sequence indicate gaps in the assembly. Alignments between gorilla assemblies are based on Miropeats (31). (C) Alignment of MHC Class II locus in Susie3 against GRCh37 with Miropeats. Alignment identities of collinear blocks between assemblies are shown above the corresponding GRCh37 sequence. Repeats internal to Susie3 are shown in red along the coordinates. Alignment identity across the entire locus is shown below the Susie3 contigs in 5-kbp windows (1 kbp sliding). Support for the proper organization of the Susie3 sequence is shown by the tiling path of concordant BAC end sequences from the Kamilah BAC library (CHORI-277). (D) A sequence-resolved complex gorilla genome structural variation orthologous to human chromosome 19:38,867,213–39,866,620 (GRCh38). The dot-matrix plot shows a 125,375-bp inversion flanked by a proximal 16-kbp deletion and 8-kbp insertion, and a 23-kbp distal deletion. The deletions remove the entire sequences of the SELV and CLC genes in gorilla when compared with human.
Fig. 5
Fig. 5. Improved mobile element resolution
(Left) PTERV1 and SVA insertion length and percent identity distributions in Susie3 (blue) and gorGor3 (red). The PTERV1 and SVA elements in gorGor3 are biased toward short but on average higher identity alignments to the consensus sequence because the more divergent long terminal repeat sequences are not resolved. (Right) The mean and median insertion lengths for gorGor3 and Susie3 are PTERV1, 2194.93, 7565.85 (median 1223 and 7725) and SVA, 1240.1, and 1965.63 (median 1162 and 1909).
Fig. 6
Fig. 6. Population genetic analyses
(A) Density of average divergence within 1-Mbp windows between human (GRCh38) and gorGor3, Susie3, or chimpanzee (panTro4) autosomes. (B) A comparison of human-gorGor3 and human-Susie3 divergence over 1-Mbp windows. The x axis is Alu coverage in each window, and the y axis is the difference in human-gorilla divergence between gorGor3 and Susie3. Positive y axis values indicate increased human–Susie3 divergence relative to human–gorGor3. The increased divergence of human–gorGor3 correlates with Alu content (slope, −0.0044094; intercept, 0.0001486; Pearson’s correlation, −0.60). (C) The effective population size (Ne) shown over time. A PSMC model was applied to the western lowland gorilla based on different genome assemblies. Illumina genome sequence data from western lowland gorillas (Abe, Amani, Coco, Tzambo) was mapped against gorGor3 (green) and Susie3 (orange), and PSMC was fit to the genome alignments (-N25 -t15 -r5 -b -p “4+25*2+4+6”; mutation rate = 1.25 × 10−8; generation time = 19 years). There are 100 bootstrap replicates for each gorilla and model. (D) The distribution of the bootstrap intervals that overlap 50 ka and 5 ma. At 50 ka, Susie3 estimates of the effective population size are significantly higher than that for gorGor3; the inverse pattern is true for 5 ma. All differences between Susie3 and gorGor3 are significant (***P ≤ 0.0001; Welch two-sample t test).

Similar articles

See all similar articles

Cited by 116 articles

See all "Cited by" articles

Publication types

LinkOut - more resources

Feedback