Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Nov 5;23(1):232.
doi: 10.1186/s13059-022-02799-4.

Deciphering the impact of genetic variation on human polyadenylation using APARENT2

Affiliations

Deciphering the impact of genetic variation on human polyadenylation using APARENT2

Johannes Linder et al. Genome Biol. .

Abstract

Background: 3'-end processing by cleavage and polyadenylation is an important and finely tuned regulatory process during mRNA maturation. Numerous genetic variants are known to cause or contribute to human disorders by disrupting the cis-regulatory code of polyadenylation signals. Yet, due to the complexity of this code, variant interpretation remains challenging.

Results: We introduce a residual neural network model, APARENT2, that can infer 3'-cleavage and polyadenylation from DNA sequence more accurately than any previous model. This model generalizes to the case of alternative polyadenylation (APA) for a variable number of polyadenylation signals. We demonstrate APARENT2's performance on several variant datasets, including functional reporter data and human 3' aQTLs from GTEx. We apply neural network interpretation methods to gain insights into disrupted or protective higher-order features of polyadenylation. We fine-tune APARENT2 on human tissue-resolved transcriptomic data to elucidate tissue-specific variant effects. By combining APARENT2 with models of mRNA stability, we extend aQTL effect size predictions to the entire 3' untranslated region. Finally, we perform in silico saturation mutagenesis of all human polyadenylation signals and compare the predicted effects of [Formula: see text] million variants against gnomAD. While loss-of-function variants were generally selected against, we also find specific clinical conditions linked to gain-of-function mutations. For example, we detect an association between gain-of-function mutations in the 3'-end and autism spectrum disorder. To experimentally validate APARENT2's predictions, we assayed clinically relevant variants in multiple cell lines, including microglia-derived cells.

Conclusions: A sequence-to-function model based on deep residual learning enables accurate functional interpretation of genetic variants in polyadenylation signals and, when coupled with large human variation databases, elucidates the link between functional 3'-end mutations and human health.

Keywords: Deep learning; Explainable AI; Genomics; Neural networks; Polyadenylation; RNA; Untranslated region; Variant interpretation.

PubMed Disclaimer

Conflict of interest statement

A.K. is a scientific co-founder of Ravel Biotechnology Inc.; is on the SAB of PatchBio Inc., SerImmune Inc., AINovo Inc., TensorBio Inc., and OpenTargets; is a consultant with Illumina Inc.; and owns shares in DeepGenomics Inc., Immuni Inc. and Freenome Inc. G.S. is a co-founder of Parse Biosciences and is on the SAB of Modulus Therapeutics.

Figures

Fig. 1
Fig. 1
A deep residual neural network for predicting polyadenylation. A Core processing elements, auxiliary RBPs, and other determinants influence polyadenylation signal affinity. B Illustration of tandem 3′ UTR alternative polyadenylation (APA) in pre-mRNA. C Residual neural network architecture. A one-hot coded representation of the PAS is used to predict the 3′ cleavage distribution. D Predicted vs measured proximal isoform log odds of native human 3′ UTR PASs measured in an MPRA (n=1085). E Predicted logit score of all human PASs as a function of PAS # relative to the distal-most PAS. F Masked softmax regression (or a LSTM) for predicting multi-PAS isoform proportions given APARENT2 and Saluki scores as input. G Left: Comparison of correlation between predicted and measured distal isoform proportions from tissue-pooled native data (20-fold cross-validation). Each model predicts logit scores which are used to fit a multi-PAS regressor. LSTM performance shown as shaded bars. Right: Improvement in Spearman r when using Saluki scores in addition to APARENT2 as input; the improvement is shown separately for genes where the maximum distance between any adjacent pair of PASs is 250bp and >250bp respectively
Fig. 2
Fig. 2
Prediction of functionally screened polyadenylation variants. A Variant of uncertain significance from ClinVar (rs886052699) measured in an MPRA [36]. Shown are the measured and predicted 3′ cleavage distributions across the PAS. Green: wildtype cleavage, magenta: variant cleavage. B Comparison of precision-recall curves when tasking each APA model with classifying disruptive APA variants (|fold change|>2) from the MPRA of Bogard et al. [36] (n = 12,350). The curves are shown for non-CSE variants only. C Comparison of predicted vs measured RNA/DNA log fold change ratios on the data from Slutskin et al. [35] (n = 442). D Comparison of predicted vs measured RNA/DNA log fold change ratios at individual cleavage sites within a given PAS
Fig. 3
Fig. 3
Interpretation of cis-acting polyadenylation variants. A Mask-based variant interpretation, reconstructing the relative odds ratio between the wildtype and mutated sequence. B Interpretation of a ClinVar SNV in the LDHA PAS (rs886048091). Left boxplot: Measured LORs of TGTA-creating variants from the MPRA of Bogard et al. [36]. Right boxplot: Measured LORs of non-TGTA-creating variants. p-values are computed with two-sided t-tests. C Interpretation of two variants of interest in the MOCS2 PAS and BMPR1A PAS. D Individual- and pairwise TGTA motifs were inserted in wildtype PASs and their LORs were measured in an MPRA [36]. E Predicted and observed LOR of individual TGTA insertions. F Predicted and observed LOR of dual TGTA insertions
Fig. 4
Fig. 4
Redundancy of functional hexamer motifs in human polyadenylation signals. A Position weight matrix (PWM) of the CSE motif, as measured in the MPRA of Bogard et al. [36]. B Predicted vs measured log odds ratio of CSE mutations from the MPRA (n = 628). Right: Log odds ratio predicted by APARENT2 vs the effect sizes predicted by a linear hexamer model trained on the same data. C Interpretation of a functionally silent CSE mutation in the TPMT gene. D Interpretation of a variant with dampened effect size in the SMAD4 gene. E Boxplot showing measured LORs of all assayed CSE mutations [36]. p-values are computed with two-sided t-tests
Fig. 5
Fig. 5
Inferring 3′ aQTL effect sizes from sequence. A Total number of 3′ aQTLs, cis-acting aQTLs, and lead aQTLs respectively (GTEx v7). Right: Predicted vs measured aQTL effect sizes in the lung. B Predicted vs measured 3′ aQTL effect size Spearman r’s (GTEx v7). Each dot corresponds to the correlation in a particular tissue type. C Predicted vs measured aQTL effect sizes of the data from Mittleman et al. [21] (n=58). D Multiple softmax regression for predicting tissue-specific isoform abundance. APARENT2 (green) and the tissue model (blue) are used to score each PAS. E Increase (red) or decrease (blue) in Spearman r when using a particular tissue model to scale the 3′ aQTL predictions made by APARENT2 in a given GTEx tissue (testis, ovary, B-cell lymphocytes, and brain). F Reconstructive mask for a SNP in the ALDH7A1 gene, with a brain-specific effect. The bottom mask is the result of 64 randomly initialized optimization attempts. Boxplot shows differential PAS usage in data from Lianoglou et al. [42]
Fig. 6
Fig. 6
Extending aQTL predictions to the entire 3′ UTR. A The total impact of a 3′ UTR mutation on isoform abundance is scored by APARENT2 and Saluki. B Absolute value of predicted vs measured 3′ aQTL effect sizes for lead SNPs and a matched set of non-lead SNPs overlapping PASs in the lung (GTEx v7). Final predictions are made by isotonic regression trained on all non-lung SNPs. C Top: Predicted vs measured aQTL effect size Spearman r’s for SNPs overlapping PASs. Each dot represents the correlation in a given tissue. Bottom: Predicted vs measured effect sizes for all 3′ UTR SNPs, using the joint APARENT2+Saluki model. D Difference in ISM maps between mutant and wildtype sequence (rs540). Green/magenta annotations correspond to APARENT2/Saluki predictions
Fig. 7
Fig. 7
Large-scale analysis of polyadenylation signal mutations and their implication in health and disease. A Relative position of mutation vs predicted Δ isoform abundance for all PAS variants in gnomAD (n = 2.8 million). Color intensity represents allele frequency. Inset: Reference vs alternate isoform abundance for all 43.8 million potential PAS SNVs (orange = gnomAD variants). B Distribution of predicted Δ isoform abundance for common gnomAD variants (AF >0.1%; green) and singletons (magenta). C Relative enrichment of disruptive variants (Δisoform abundance <-0.15) with respect to singleton variants. Wilcoxon p-values are shown above each bar. D Absolute predicted isoform fold change vs p-value of fine-mapped GWAS SNPs from CAUSALdb (95% credible set, n = 4200) [70]. E Distribution of predicted log odds ratios for the F2 PAS. F Distribution of predicted log odds ratios for the SCAF8 PAS. G Predicted log odds ratios among ASD cases and controls from a WGS study [51]
Fig. 8
Fig. 8
MPRA validation in multiple cell lines. A APA reporter system for measuring variant effects in HEK293T, SK-N-SH and HMC3. B Predicted vs measured variant effects (LORs) in the three assayed cell lines. C Predicted vs measured effects of 9 GWAS SNPs (PIP = posterior inclusion probability). D Measured effects of 2 SNVs in the F2 and SCAF8 genes (orange), alongside common gnomAD SNPs (blue). E Measured effects of 76 autism variants from An et al. [51]. p-values are computed with two-sided Wilcoxon tests

Similar articles

Cited by

References

    1. Elkon R, Ugalde AP, Agami R. Alternative cleavage and polyadenylation: extent, regulation and function. Nat Rev Genet. 2013;14(7):496–506. - PubMed
    1. MacDonald CC, Redondo JL. Reexamining the polyadenylation signal: were we wrong about aauaaa? Mol Cell Endocrinol. 2002;190(1–2):1–8. - PubMed
    1. Tian B, Graber JH. Signals for pre-mrna cleavage and polyadenylation. Wiley Interdiscip Rev RNA. 2012;3(3):385–96. - PMC - PubMed
    1. Grozdanov PN, Masoumzadeh E, Latham MP, MacDonald CC. The structural basis of cstf-77 modulation of cleavage and polyadenylation through stimulation of cstf-64 activity. Nucleic Acids Res. 2018;46(22):12022–39. - PMC - PubMed
    1. Nazim M, Masuda A, Rahman MA, Nasrin F, Takeda JI, Ohe K, Ohkawara B, Ito M, Ohno K. Competitive regulation of alternative splicing and alternative polyadenylation by hnrnp h and cstf64 determines acetylcholinesterase isoforms. Nucleic Acids Res. 2017;45(3):1455–68. - PMC - PubMed

Publication types

Substances