Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2011 Aug 28;477(7365):419-23.
doi: 10.1038/nature10414.

Multiple Reference Genomes and Transcriptomes for Arabidopsis Thaliana

Free PMC article
Comparative Study

Multiple Reference Genomes and Transcriptomes for Arabidopsis Thaliana

Xiangchao Gan et al. Nature. .
Free PMC article


Genetic differences between Arabidopsis thaliana accessions underlie the plant's extensive phenotypic variation, and until now these have been interpreted largely in the context of the annotated reference accession Col-0. Here we report the sequencing, assembly and annotation of the genomes of 18 natural A. thaliana accessions, and their transcriptomes. When assessed on the basis of the reference annotation, one-third of protein-coding genes are predicted to be disrupted in at least one accession. However, re-annotation of each genome revealed that alternative gene models often restore coding potential. Gene expression in seedlings differed for nearly half of expressed genes and was frequently associated with cis variants within 5 kilobases, as were intron retention alternative splicing events. Sequence and expression variation is most pronounced in genes that respond to the biotic environment. Our data further promote evolutionary and functional studies in A. thaliana, especially the MAGIC genetic reference population descended from these accessions.

Conflict of interest statement

The authors declare no competing financial interests.


Figure 1
Figure 1. Assembly and variation of 18 genomes of A. thaliana
a, Classification of sequence, SNPs and indels based on the Col-0 genome. b, Assembly accuracy (y axis; base substitution errors per 10 kb) measured relative to four validation data sets at each of eight stages in the IMR/DENOM assembly pipeline (x axis). Bur-0 survey (blue line): 1,442 survey sequences (about 417 bp each) in predominantly genic regions; Bur-0 divergent (red line): 188 sequences (each about 254 bp) highly divergent from Col-0 (ref. 3); Ler-0 nonrepetitive (orange line): a predominantly single-copy 175-kb Ler-0 sequence on chromosome 5; Ler-0 repetitive (purple line): a highly repetitive 339-kb Ler-0 locus on chromosome 3 (ref. 18; Supplementary Information section 4). Iter, iteration. c, Genome-wide distribution of the minimum clade size for all pairs of accessions (excluding Po-0). Each pair is represented by a grey line, the mean over all pairs by the black line and the random distribution by the green line. d, Decay in linkage disequilibrium with distance (Po-0 excluded). The black line shows r2 between SNPs; the red line shows phylogenetic r2 (Supplementary Information section 6).
Figure 2
Figure 2. Transcript and protein variation
a, Example of a splice site change between two haplotypes for the gene AT1G64970. Haplotype I (Col-0) is spliced with an intron 6 bp (two amino acids) shorter than haplotype II (Ler-0); Po-0 (heterozygous) shows allele-specific expression of both. b, Re-annotation of the FRIGIDA locus showing annotations for accessions Sf-2 (functional), and Col-0 (truncated by a premature stop) and Ler-0 (non-functional) (Supplementary Figs 18 and 42). Right: the 19 accessions are shown clustered on the basis of the AA distance between their FRIGIDA amino-acid sequences. Common isoform clusters (at distance 2% or less; red line) are shown, leading to three clusters with three, seven and nine accessions. c, Proteome diversity for coding genes, pseudogenes and A. lyrata genes (top) and for genes with disruptions (bottom). Reported is the fraction of genes with relative AA distance to other accessions (average over pairs) in the given colour-coded interval (Supplementary Information section 10.7). d, Frequency of isoforms of coding genes and pseudogenes (top), and those associated with different disruptions (bottom).
Figure 3
Figure 3. Quantitative variation of coding gene expression
a, The overlap between heritable (more than 30%) and differentially expressed (FDR 5%) genes, and genes with a cis-eQTL (FDR 5%). b, Differentially expressed genes and genes with cis-eQTLs (FDR 5%) categorized by fold change. Nucleotide variants (orange bars; 647 cis-eQTLs) are SNPs and single-base indels; copy-number variants (green bars; 42 cis-eQTLs) are regions with elevated coverage in aligned genomic reads in at least one accession; gene structural variants (black bars; 227 cis-eQTLs) are accession-specific deletions, insertions or changes to the gene model. c, The spatial distribution of nucleotide-variant eQTLs relative to the start of protein-coding genes (FDR 5%, overlapping genes removed; n = 647). The line shows density of gene length. d, Frequencies of nucleotide-variant eQTLs in protein-coding genes, classified by component (bar widths are proportional to the components’ average physical lengths): red bars, upstream; yellow bars, 5′ untranslated region; green bars, coding sequence exons; blue bars, introns; cyan bars, 3′ untranslated region; grey bars, downstream.
Figure 4
Figure 4. Protein diversity and gene expression vary by gene category or family
The numbers next to each row are gene counts. The gene families were selected from Supplementary Figs 26 and 39–41 to represent the breadth of observed variation. a, Distribution of average AA distances to other accessions (compare with Fig. 2c). b, Fraction of unexpressed, expressed and differentially expressed genes (expressed is a superset of differentially expressed). c, Distribution of genes categorized by fold change (between lowest and highest across 19 accessions). d, Distribution of the numbers of accessions contributing to differential expression. TF, transcription factor; CC, coiled-coil; TIR, Toll interleukin-1 receptor; NB-LRR, nucleotide-binding leucine-rich repeat.

Comment in

Similar articles

See all similar articles

Cited by 273 articles

See all "Cited by" articles


    1. Johanson U, et al. Molecular analysis of FRIGIDA, a major determinant of natural variation in Arabidopsis flowering time. Science. 2000;290:344–347. - PubMed
    1. Bentley DR, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456:53–59. - PMC - PubMed
    1. Ossowski S, et al. Sequencing of natural strains of Arabidopsis thaliana with short reads. Genome Res. 2008;18:2024–2033. - PMC - PubMed
    1. Schneeberger K, et al. Reference-guided assembly of four diverse Arabidopsis thaliana genomes. Proc Natl Acad Sci USA. 2011;108:10249–10254. - PMC - PubMed
    1. Weigel D, Mott R. The 1001 genomes project for Arabidopsis thaliana. Genome Biol. 2009;10:107. doi: 10.1186/gb-2009-10-5-107. - DOI - PMC - PubMed

Publication types

Associated data