Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jun 27;178(1):91-106.e23.
doi: 10.1016/j.cell.2019.04.046. Epub 2019 Jun 6.

A Deep Neural Network for Predicting and Engineering Alternative Polyadenylation

Affiliations

A Deep Neural Network for Predicting and Engineering Alternative Polyadenylation

Nicholas Bogard et al. Cell. .

Abstract

Alternative polyadenylation (APA) is a major driver of transcriptome diversity in human cells. Here, we use deep learning to predict APA from DNA sequence alone. We trained our model (APARENT, APA REgression NeT) on isoform expression data from over 3 million APA reporters. APARENT's predictions are highly accurate when tasked with inferring APA in synthetic and human 3'UTRs. Visualizing features learned across all network layers reveals that APARENT recognizes sequence motifs known to recruit APA regulators, discovers previously unknown sequence determinants of 3' end processing, and integrates these features into a comprehensive, interpretable, cis-regulatory code. We apply APARENT to forward engineer functional polyadenylation signals with precisely defined cleavage position and isoform usage and validate predictions experimentally. Finally, we use APARENT to quantify the impact of genetic variants on APA. Our approach detects pathogenic variants in a wide range of disease contexts, expanding our understanding of the genetic origins of disease.

Keywords: MPRA; SNV; alternative polyadenylation; cis-regulation; deep learning; generative model; mRNA processing; machine learning; massively parallel reporter assay; single nucleotide variant; synthetic biology.

PubMed Disclaimer

Conflict of interest statement

Declaration of Interests

The authors declare no competing interests.

Figures

Figure 1.
Figure 1.. Massive Parallel Reporter Assay for Alternative Polyadenylation
A) APA and the polyA signal. Newly-transcribed mRNA is targeted by multiple factors (grey) that enhance or suppress selection of APA sites. A PAS (below) is defined by the 6-base CSE and regions of approximately 50 bp both upstream and downstream. (B) Massively parallel reporter assay for APA. Millions of unique reporters are cloned from degenerate oligos and transiently transfected in human cell culture where they are expressed and alternatively polyadenylated. RNA is extracted, sequenced and quantified for every reporter, and the data used to train a deep neural network. (C) The library is comprised of multiple sublibraries that vary in structure and 3’UTR context (human italicized; CSE sequence denoted in legend; thick black bar = plasmid, thin = native 3’UTR sequence). Degenerate sequence was introduced up and/or downstream (N20-45) of the proximal PAS, and in some cases, within the CSE (degenerate, red). The CSE sequence is either the canonical AATAAA (green), the canonical with a single base substitution (yellow), or varied (red; SNHG6 = NNTAAA, Alien1 = AWTAAA, Alien2 = (95% A, 2% G/C, 3% T)). (D) The distribution of relative proximal site usage per unique sequence for each library. Each histogram is color-coded to match the proximal CSE. See also Figure S1.
Figure 2.
Figure 2.. Model Architecture, Performance and Layer-by-Layer Feature Analysis
(A) APARENT takes a 1-Hot-coded PAS sequence as input to predict % proximal isoform and % cleavage at each position. (B) Predicted vs. observed proximal isoform log odds of the test set. (C) Cross-library confusion matrix when predicting proximal isoform use. Diagonal entries are tests on the training library. Off-diagonal entries are models trained on one library but tested on another. (D) Predicted vs. observed proximal use on held-out random libraries (HSPE1, SNHG6, and WHAMMP2; mean R2 = .58), and on the held-out native human PAS library (HUMAN; R2 = .69). (E) RBP motif logos and per-position effects (Pearson’s r between filter activation and proximal use) learned in the first and second convolutional layers. Layer 2 filters are shown with their proposed effector interactions. Additional RBP motifs of Layer 1 and 2 are shown in Figure S2E–F. (F) Left: Sequence Logos for the CSE detector filter. Right: Ranked CSE variants extracted from the filter. Below: Comparison of variant scores against previously-published data. See also Figure S2 and Table S1, S2.
Figure 3.
Figure 3.. Cutsite Predictions and Regulatory Determinants Learned by APARENT
(A) APARENT’s output layer predicts the probability of cleavage at each of the 186 input nucleotides. (B) Average predicted cut position on the Alien1 test set. X-axis denotes average position. The Y-axis sorts each sequence on observed cut position. (C) Predicted isoform abundance using the DNN of Figure 2A vs. integration of cleavage distribution using the DNN of Figure 3A. (D) Example Alien1 sequences with their predicted (blue) and observed (red) cut distribution. The panel also shows the average Alien1 cut distribution. (E) Selection of layer 1 filters validating known and newly discovered cleavage determinants. Each plot measures correlation between a filter activating at a certain position and cleavage occurring at some other position. See Figure S3B for additional identified motifs. (F) % sequences from the Alien1 library with significantly folded USEs (black) or CSE-Cut regions (red), as a function of cut position relative to the CSE. “USE” refers the 50 nt region upstream of the CSE, and “CSE-Cut” refers to the region from the end of the CSE to the cut position. (G) Predicted vs. observed cut magnitude at positions near (left) or far from (right) the CSE. The color encodes DSE MFE. See also Figure S3.
Figure 4.
Figure 4.. Forward-Engineering of PASs by Backpropagation
(A) A sequence PWM is iteratively optimized against a polyA objective by gradient ascent through APARENT. (B) The sequences were synthesized in an oligo array, expressed in HEK293 and measured by sequencing. (C) PWMs engineered for target isoform abundance. Shown are the target objectives, the measured percentile among human sequences, the number of PWM samples, and measured proximal use. (D) Measured isoform use per isoform objective. The ‘Native’ category displays human sequences. (E) Saturation mutagenesis of a PWM maximized for proximal use. The heatmap shows measured isoform fold change (in log-scale) as a result of each SNV. APARENT’s predicted log fold changes strongly agree with these measurements (R2 = 0.77). (F) PWMs engineered for cleavage at target positions. Shown are the target objectives, the number of PWM samples, measured/predicted cleavage profiles and predicted MFE structures of the DSE. (G) Average cut profile of all synthesized sequences separated by objective cut position. (H) Saturation mutagenesis of a PWM optimized for cleavage at CSE+35. The heatmap shows measured change in cleavage proportion as a result of each SNV. (I) The measured SNV effects of Figure 4H grouped by the effect on DSE hairpin folding. The Y-axis displays the measured fold change (in log-scale) of cleavage proportion at the target cut site (CSE+35) due to each SNV. See also Figure S4 and Movie S1, S2.
Figure 5.
Figure 5.. Performance of APARENT on Endogenous APA
(A) The extended model for predicting isoform ratio between adjacent APA sites of the APADB and Leslie datasets. (B) Leave-one-out Cross-validation of predicted vs. observed proximal use on the 1,000 Pooled-tissue APADB PASs with highest measured total read count. (C) ROC curves obtained from classifying preferential site usage on held-out APADB data. APARENT is compared against a DNN and a linear 6-mer model trained only on APADB. (D) Preferential site prediction ROC curves in tissues and cell types from APADB and Leslie datasets. (Top) All pairs of adjacent PASs. (Bottom) only pairs where both sites have supporting reads in the tissue. (E) Bar chart of preferential site classification scores (AUC) per tissue. (F) Predicted vs. observed mean cut position relative to the CSE of individual human PAS sequences. The observed mean cut position per PAS was estimated from the Lianoglou et al. RNA-Seq data, pooled across cell types. Only high-quality estimated PASs with a minimum pooled read count of 500 were included (n=797, R2 = .74). (G) Bar chart showing the correlation (R2) between predicted and observed mean cut position of high-quality estimated PASs with a minimum read count of 200, separated by cell type of the Lianoglou et al. data. See also Figure S5.
Figure 6.
Figure 6.. Predicting the human APA variants and the existence of complex variation
(A) MPRA used to measure the effect of human genomic APA variants on APA from ClinVar, hGmD, and ACMG genes. (B) Measured isoform fold changes of all variants (Y-axis). X-axis denotes PAS position. The color indicates predicted fold change (blue/red = −/+ log odds ratio). All variants: R2 = 0.64. Significant variants (p < 0.0001): R2 = 0.75. Correctly predicted direction of change in isoform abundance (increase or decrease) = 90.3% of all variants. The figure is also annotated with the total number and % correctly predicted direction of fold change for USE and DSE variants with at least a 2-fold change in isoform abundance. (C) Variants with significant measured fold change (p < 0.00001) where APARENT and a linear 6-mer model disagree on the direction of change. (D) Two complex variant examples discovered in Figure 6C. (Top) A CSTF binding site variant was measured to decrease total RNA isoform abundance 1.27-fold. (Bottom) A cryptic CSE hexamer variant was measured to increase total RNA isoform abundance 1.6-fold. See also Figure S6.
Figure 7.
Figure 7.. Characterization of pathogenic variants and PASs linked to disease
(A) Measured fold changes of pathogenic variants in ClinVar and HGMD (Y-axis). X-axis denotes PAS position. Color indicates APARENT predictions. The table lists all variant identifiers. (B) Strip plot of all assayed variants categorized by clinical significance in ClinVar. The ‘ACMG’ category contains all of the unannotated variants obtained from saturation mutagenesis. The ‘Benign’ set contains 170 variants, ‘Likely benign’ contains 419 variants, ‘VUS’ contains 946 variants and ‘Pathogenic’ contains 19 variants. AUC = 0.916 for classifying the pathogenic set from the benign set based only on APARENT’s predicted log isoform fold change. (C) Saturation mutagenesis of the TP53 (Basal Cell Carcinoma) and FOXC1 (Glaucoma) PASs. The heatmaps visualize measured and predicted APA isoform fold changes and are annotated with variants from ClinVar and HGMD. Shown above the heatmaps are correlations (R2) between predicted and measured isoform log fold changes, the total number of variants with at least 2-fold changes in isoform abundance and % correctly predicted direction of fold change. (D) Saturation mutagenesis of the F2 (Thrombophilia) PAS. Shown are also the measured wildtype and variant cleavage distributions (black and red lines), and the predicted variant cleavage distribution (blue dashed line), for the pathogenic 97G>A variant (measured / predicted RNA isoform abundance increase = 1.39 / 1.31-fold). See also Figure S7.

Similar articles

Cited by

References

    1. 1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, and McVean GA (2012). An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65. - PMC - PubMed
    1. Alipanahi B, Delong A, Weirauch MT, and Frey BJ (2015). Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol 33, 831–838. - PubMed
    1. Ausländer S, Ausländer D, Müller M, Wieland M, and Fussenegger M (2012). Programmable single-cell mammalian biocomputers. Nature 487, 123–127. - PubMed
    1. Bennett CL, Brunkow ME, Ramsdell F, O’Briant KC, Zhu Q, Fuleihan RL, Shigeoka AO, Ochs HD, and Chance PF (2001). A rare polyadenylation signal mutation of the FOXP3 gene (AAUAAA-->AAUGAA) leads to the IPEX syndrome. Immunogenetics 53, 435–439. - PubMed
    1. Bentley DL (2014). Coupling mRNA processing with transcription in time and space. Nat. Rev. Genet 15, 163–175. - PMC - PubMed

Publication types