Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Mar;24(3):365-76.
doi: 10.1101/gr.164749.113. Epub 2013 Dec 17.

A-to-I RNA Editing Occurs at Over a Hundred Million Genomic Sites, Located in a Majority of Human Genes

Affiliations
Free PMC article

A-to-I RNA Editing Occurs at Over a Hundred Million Genomic Sites, Located in a Majority of Human Genes

Lily Bazak et al. Genome Res. .
Free PMC article

Abstract

RNA molecules transmit the information encoded in the genome and generally reflect its content. Adenosine-to-inosine (A-to-I) RNA editing by ADAR proteins converts a genomically encoded adenosine into inosine. It is known that most RNA editing in human takes place in the primate-specific Alu sequences, but the extent of this phenomenon and its effect on transcriptome diversity are not yet clear. Here, we analyzed large-scale RNA-seq data and detected ∼1.6 million editing sites. As detection sensitivity increases with sequencing coverage, we performed ultradeep sequencing of selected Alu sequences and showed that the scope of editing is much larger than anticipated. We found that virtually all adenosines within Alu repeats that form double-stranded RNA undergo A-to-I editing, although most sites exhibit editing at only low levels (<1%). Moreover, using high coverage sequencing, we observed editing of transcripts resulting from residual antisense expression, doubling the number of edited sites in the human genome. Based on bioinformatic analyses and deep targeted sequencing, we estimate that there are over 100 million human Alu RNA editing sites, located in the majority of human genes. These findings set the stage for exploring how this primate-specific massive diversification of the transcriptome is utilized.

Figures

Figure 1.
Figure 1.
Detection of A-to-I editing in Alu repeats. (A) Multiple alignment of reads to the reference genome reveals sites of A-to-I editing (red), as well as genomic polymorphisms and sequencing errors (yellow). Detection sensitivity is improved upon examining clusters of mismatches rather than looking at each site independently. Yet, at low coverage, many bona fide editing sites either do not show any AG mismatch, or show a weak signal indistinguishable from sequencing errors. The sites detected include the few strongly edited sites and a random sample of the weaker sites. (B) Ultradeep coverage enables the full scope of editing to be revealed, showing all sites that support editing, typically at very low levels (<1%).
Figure 2.
Figure 2.
Mismatch distributions along the detection pipeline. (A) Even a simple count of all mismatches in high-quality base pairs of sequencing reads data of Alu repeats shows a significant enrichment of editing-derived mismatch types (AG and TC). (B) Applying a strict statistical model to filter out probable sequencing errors further increases the fraction of AG/TC mismatches, but results in the loss of most of the estimated true editing signal as well. (C) In this study, we focused on the full Alu repeats rather than single genomic sites. This improves the statistical power, with only a minor reduction in the signal. As a result, we found that virtually all Alu repeats are dominated by AG/TC mismatches. (D–F) The same pipeline applied to mismatches located in the common L1 retroelement. Clearly, the strong propensity for A-to-I RNA editing is unique to the Alu repeat. However, some enrichment of AG/TC mismatches is nevertheless observed, attesting to some editing activity in the L1 repeats.
Figure 3.
Figure 3.
Distribution of downstream (A) and upstream (B) nucleotides for editing sites detected in the HBM data sets. Edited sites are split into three groups according to their editing level: low level ≤10%, high level ≥40%, and medium level >10% and <40%. A clear signature of the ADAR sequence preference is observed (low G upstream of the site, and some enrichment downstream from the site). The preference is stronger at sites with high editing levels.
Figure 4.
Figure 4.
Distribution of editing events along the consensus for the eight most edited Alu subfamilies (UCSC Genome Browser annotation). The number of edited Alu repeats of each family is given. Clearly, there are hotspots for editing in each of the families.
Figure 5.
Figure 5.
Average editing levels per tissue in HBM data. For each tissue, the total mismatches (before filtering) are grouped for each of the four bases and presented according to the mismatch type. Although in A (T) positions, only one type of mismatch is dominant (G or C, accordingly), at C and G the picture is very different, exhibiting a lower number of mismatches (note the different scale) with a more even distribution. (A) A reference positions with non-A reads, per tissue. (B) T reference positions with non-T reads, per tissue. (C) C reference positions with non-C reads, per tissue. (D) G reference positions with non-G reads, per tissue.
Figure 6.
Figure 6.
Editing detection is sensitive to sequencing coverage. (A) The average number of adenosines in an Alu repeat showing evidence for editing increases with the available coverage (number of reads supporting the examined nucleotide), with no sign of saturation (HBM data). A number of mismatch sites of types other than AG/TC saturate at a relatively low coverage (after applying the statistical model to filter sequencing errors). As the typical coverage in RNA-seq is much lower than 1000 reads, this suggests that previous counts of editing sites are grossly underestimated. (B) Fraction of Alu repeats showing evidence of editing (i.e., dominated by AG/TC mismatches). Again, strong dependence on coverage is observed, and atypically high coverage is required for detection in most of the Alu repeats. Our ultradeep MiSeq experiment reached saturation with all Alu repeats detected at a coverage of 1000 reads (coverage is defined as the median read coverage for the adenosines and thymines in the given Alu repeat). Based on these calculations, we estimate the total number of A-to-I editing sites in the human genome to exceed 100 million sites. (C) Number of different transcript variants per Alu, as a function of the reads' coverage. No saturation is observed even for ultrahigh coverage.
Figure 7.
Figure 7.
Mismatch fraction distribution. Even before applying any statistical filters or analysis, a marked distinction is evident between AG/TC mismatches and other types of mismatches, provided there is sufficiently deep coverage. Presented are the distributions of the mismatch fractions (percent of reads that exhibit the mismatch among all reads supporting the site) for all (high quality, Q ≥ 30) mismatches seen in our MiSeq experiment at sites with high coverage (≥5000 reads, allowing for an accurate assessment of the mismatch fraction). Most mismatches are likely to result from sequencing errors and occur at fractions <0.1%, consistent with the sequencing quality. The AG/TC mismatches span a different range of mismatch fractions, where the bulk of the distribution lies in the range 0.1%–1%, but some sites are edited with stronger efficiencies, up to those showing close to 100% editing in a few sites. This separation of scales allows identification of editing sites, provided an accurate assessment of the mismatch fraction (requiring ultradeep coverage) is available. MM, mismatch. The y-axis shows the normalized probability density P(−log[MM fraction]).
Figure 8.
Figure 8.
Mismatch distribution along the reads. (A) AG/TC sites are evenly distributed along the reads and are even slightly depleted toward the read ends, as the alignments are more sensitive to mismatches in this region. (B) Other types of mismatches (GA/CT) show a pronounced increase toward the read ends, suggesting many of these mismatches, albeit trimming, could be attributed to alignment artifacts. Reads are 75 bp long.

Similar articles

See all similar articles

Cited by 168 articles

See all "Cited by" articles

Publication types

LinkOut - more resources

Feedback