Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 27 (5), 885-896

An Improved Assembly and Annotation of the Allohexaploid Wheat Genome Identifies Complete Families of Agronomic Genes and Provides Genomic Evidence for Chromosomal Translocations

Affiliations

An Improved Assembly and Annotation of the Allohexaploid Wheat Genome Identifies Complete Families of Agronomic Genes and Provides Genomic Evidence for Chromosomal Translocations

Bernardo J Clavijo et al. Genome Res.

Abstract

Advances in genome sequencing and assembly technologies are generating many high-quality genome sequences, but assemblies of large, repeat-rich polyploid genomes, such as that of bread wheat, remain fragmented and incomplete. We have generated a new wheat whole-genome shotgun sequence assembly using a combination of optimized data types and an assembly algorithm designed to deal with large and complex genomes. The new assembly represents >78% of the genome with a scaffold N50 of 88.8 kb that has a high fidelity to the input data. Our new annotation combines strand-specific Illumina RNA-seq and Pacific Biosciences (PacBio) full-length cDNAs to identify 104,091 high-confidence protein-coding genes and 10,156 noncoding RNA genes. We confirmed three known and identified one novel genome rearrangements. Our approach enables the rapid and scalable assembly of wheat genomes, the identification of structural variants, and the definition of complete gene models, all powerful resources for trait analysis and breeding of this key global crop.

Figures

Figure 1.
Figure 1.
Summary of the TGACv1 wheat genome sequence assembly. (A,B) KAT spectra-cn plots comparing the PE reads to the TGACv1 scaffolds (A) and CSS scaffolds (B). Plots are colored to show how many times fixed length words (k-mers) from the reads appear in the assembly; frequency of occurrence (multiplicity; x-axis) and number of distinct k-mers (y-axis). Black represents k-mers missing from the assembly; red, k-mers that appear once in the assembly; green, twice; etc. Plots were generated using k = 31. The black distribution between k-mer multiplicity 15 and 45 in B represents k-mers that do not appear in the CSS assembly. (C) Comparison of scaffold lengths and total assembly sizes of the TGACv1, W7984, and CSS assemblies. (D) Scaffold 577042 of the TGACv1 assembly. Tracks from top to bottom: aligned BAC contigs, CSS contigs, W7984 contigs, coverage of PE reads, coverage of LMP fragments, and GC content with scaffolded gaps (N stretches) with 0% GC highlighted in green. There are two BACs (composed of seven and four contigs each), 22 CSS contigs, and 15 W7984 contigs across the single TGACv1 scaffold.
Figure 2.
Figure 2.
Comparative alignment of TGACv1 scaffolds with the 3B BAC-based pseudomolecule. (A,C) Dot plots between TGACv1 scaffolds and 3B show disruptions in sequence alignment, including rearrangements (red) and inversions (blue). (B,D) Graphical representation of sequence annotations in disrupted regions. Junctions in the TGACv1 scaffolds are consistent with a complete retroelement spanning the junction that includes identical TSD on either side of the retroelement (asterisks). Corresponding regions in the 3B BAC-based pseudomolecule are characterized by Ns that produce inconsistent alignment of retroelements across putative junctions. Retroelements of the same family (CACTA, Sabrina) but matching distinct members in the TREP database are indicated by different colors. Numbers adjacent to sequences correspond to regions shown in panel A and C, respectively. (B) Scale bars, 10 kbp; (D) scale bars, 30 kbp.
Figure 3.
Figure 3.
Comparison between IWGSC annotation and TGACv1 high (HC) and low confidence (LC) genes. IWGSC genes were aligned to the TGACv1 assembly (gmap, ≥90% coverage, ≥95% identity) and classified based on overlap with TGACv1 genes. (A) Identical indicates shared exon–intron structure; contained, exactly contained within the TGACv1 gene; structurally different, alternative exon–intron structure; and missing, no overlap with IWGSC. (B) Bar plot showing proportion of HC TGACv1 protein-coding genes supported by protein similarity or PacBio data. Genes are classified based on overlap with the full set of IWGSC genes.
Figure 4.
Figure 4.
Circular representation of the TGACv1 CS42 assembly. Chromosomes, genetic bins, and genomic features are visualized on the outer rings (AH) and interchromosomal links identify known and potentially novel translocation events. (A) The seven chromosome groups of the A, B, and D genomes, scaled by number of genetic bins (black bands). (BH) Combined heatmap/histogram representations of genomic features per genetic bin. With the exception of D, all counts are normalized by the size of the genetic bin in Mbp, calculated as the total size of all scaffolds assigned to the bin. (B) Distribution of unique genes, i.e., genes that did not have orthologs in a genome-wide OrthoMCL screen. (C) Distribution of wheat-specific genes. (D,E) Number of HC protein-coding genes. (F) Distribution of DTC, DTM, and DTH DNA transposons (Supplemental Information S1; Supplemental Table S7.1). (G) Distribution of RLX, RLC, RLG, RXX, and RIX retrotransposons. (H) Distribution of tandem duplications. Light yellow links connect homoeologous OrthoMCL triads. Dark yellow-colored links connect genetic bins harboring OrthoMCL outlier triads (Supplemental Information S1, section S6) that identify known translocation events. Dark green links connect genetic bins harboring at least three OrthoMCL outlier triads that may support novel translocation events. The cyan link shows a novel PCR-validated translocation event between Chromosomes 5BS-4BL.
Figure 5.
Figure 5.
Response of differentially expressed (DE) triads to stress treatments according to the number and pattern of DE homoeologs. Triads were classified as having one homoeolog DE (yellow), two homoeologs DE with same direction of change (green), three homoeologs DE with same direction of change (orange), or opposite direction of change between DE homoeologs (blue). The stresses applied were drought (D), heat (H), drought and heat combined (DH), powdery mildew (PM), and stripe rust (SR), with the duration of stress application indicated in hours (h).
Figure 6.
Figure 6.
Genes encoding the gibberellin (GA) biosynthetic and signaling pathway in bread wheat. The GA biosynthesis, inactivation, and signal transduction pathway, illustrating the representation of the gene sequences in CSS and TGACv1 assemblies. If more than one paralog is known for a gene, its number according to the classification by Pearce et al. (2015) is indicated on the left of the box. Bioactive GAs are boxed in red.

Similar articles

See all similar articles

Cited by 99 articles

See all "Cited by" articles

Publication types

LinkOut - more resources

Feedback