Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
, 17 (5), 632-40

A Large Number of Novel Coding Small Open Reading Frames in the Intergenic Regions of the Arabidopsis Thaliana Genome Are Transcribed and/or Under Purifying Selection

Affiliations
Comparative Study

A Large Number of Novel Coding Small Open Reading Frames in the Intergenic Regions of the Arabidopsis Thaliana Genome Are Transcribed and/or Under Purifying Selection

Kousuke Hanada et al. Genome Res.

Abstract

Large-scale cDNA sequencing projects and tiling array studies have revealed the presence of many unannotated genes. For protein coding genes, small coding sequences may not be identified by gene finders because of the conservative nature of prediction algorithms. In this study, we identified small open reading frames (sORFs) with high coding potential by a simple gene finding method (Coding Index, CI) based on the nucleotide composition bias found in most coding sequences. Applying this method to 18 Arabidopsis thaliana and 84 yeast sORF genes with evidence of expression at the protein level gives 100% accurate prediction. In the A. thaliana genome, we identified 7159 sORFs that are likely coding sequences (coding sORFs) with the CI measure at the 1% false-positive rate. To determine if these coding sORFs are parts of functional genes, we evaluated each coding sORF for evidence of transcription or evolutionary conservation. At the 5% false-positive rate, we found that 2996 coding sORFs are likely expressed in at least one experimental condition of the A. thaliana tiling array data. In addition, the evolutionary conservation of each A. thaliana sORF was examined within A. thaliana or between A. thaliana and five plants with complete or partial genome sequences. In 3997 coding sORFs with readily identifiable homologous sequences, 2376 are subject to purifying selection at the 1% false-positive rate. After eliminating coding sORFs with similarity to known transposable elements and those that are likely missing exons of known genes, the remaining 3241 coding sORFs with either evidence of transcription or purifying selection likely belong to novel coding genes in the A. thaliana genome.

Figures

Figure 1.
Figure 1.
Analysis procedures and summary of results The overall procedures for identifying sORFs (between 90 and 300 bp) that have qualifying Coding Index (CI) values, above background tiling array hybridization intensities, evidence of purifying selection, and cognate cDNA/ESTs.
Figure 2.
Figure 2.
Frequency distributions of posterior probabilities for simulated coding and noncoding sequences. (A) Distribution of posterior probability (pp) of sequences resembling noncoding sequence (NCDS). Ten-thousand random sequences were generated based on the hexamer and pentamer frequencies of intronic ORFs. The great majority of simulated sequences have very small pp, and only 5% of the pp values are >0.2239. (B) The pp distribution of sequences resembling coding sequences (CDSs). Random sequences were generated according to cDNA CDSs. Approximately 10% of the CDS-like random sequences have pp values <0.2239.
Figure 3.
Figure 3.
Sliding window calculation of pp in genomic sequences surrounding IDA. The pp values were determined in 75-bp windows with 3-bp steps for A. thaliana chromosome sequences. The pp values in a region containing the small protein gene IDA and flanking sequences are shown. The diagram on top indicates the locations of exons (white box, untranslated regions; black box, CDS), introns (bent lines), transcriptional starts (small arrows), and intergenic sequences (thick gray lines). The six plots below the annotation diagram are the results of pp calculations in six reading frames (forward, +; reverse, −). The dotted line indicates pp = 0.2239, the threshold value for calling whether a 75-bp window is likely a CDS or not. The shaded areas highlight the overlap between IDA CDS and regions with a high pp. The arrow indicates the correct frame for the IDA CDS.
Figure 4.
Figure 4.
Distributions of CI values of CDS and NCDS The CI value distributions are shown as box plots with the solid horizontal line indicating the median CI value, the box representing the inter quartile range (25%–75%), and the dotted line indicating the first to the 99th percentile. CDS refers to the exon coding sequences derived from full-length cDNAs. sORFs of NCDSs are obtained from two types of sequences: (1) annotated intergenic regions and (2) intron sequences derived from full-length cDNAs.
Figure 5.
Figure 5.
Distributions of hybridization intensities values for probes in intron, coding sORFs and exons. The distribution of intensities values from the 7-d-old seeding tiling array expression data for probes in exons (A), introns (B), coding sORFs (C), tRNA genes (D), and rRNA genes (E). X-axis and Y-axis indicate logarithmically transformed intensity values (base 10) of expression and frequency of probes in different intensity bins, respectively.

Similar articles

See all similar articles

Cited by 57 articles

See all "Cited by" articles

Publication types

MeSH terms

Substances

Associated data

LinkOut - more resources

Feedback