Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Aug;28(8):1228-1242.
doi: 10.1101/gr.229401.117. Epub 2018 Jun 15.

Predicting Human Genes Susceptible to Genomic Instability Associated With Alu/ Alu-mediated Rearrangements

Affiliations
Free PMC article

Predicting Human Genes Susceptible to Genomic Instability Associated With Alu/ Alu-mediated Rearrangements

Xiaofei Song et al. Genome Res. .
Free PMC article

Abstract

Alu elements, the short interspersed element numbering more than 1 million copies per human genome, can mediate the formation of copy number variants (CNVs) between substrate pairs. These Alu/Alu-mediated rearrangements (AAMRs) can result in pathogenic variants that cause diseases. To investigate the impact of AAMR on gene variation and human health, we first characterized Alus that are involved in mediating CNVs (CNV-Alus) and observed that these Alus tend to be evolutionarily younger. We then computationally generated, with the assistance of a supercomputer, a test data set consisting of 78 million Alu pairs and predicted ∼18% of them are potentially susceptible to AAMR. We further determined the relative risk of AAMR in 12,074 OMIM genes using the count of predicted CNV-Alu pairs and experimentally validated the predictions with 89 samples selected by correlating predicted hotspots with a database of CNVs identified by clinical chromosomal microarrays (CMAs) on the genomes of approximately 54,000 subjects. We fine-mapped 47 duplications, 40 deletions, and two complex rearrangements and examined a total of 52 breakpoint junctions of simple CNVs. Overall, 94% of the candidate breakpoints were at least partially Alu mediated. We successfully predicted all (100%) of Alu pairs that mediated deletions (n = 21) and achieved an 87% positive predictive value overall when including AAMR-generated deletions and duplications. We provided a tool, AluAluCNVpredictor, for assessing AAMR hotspots and their role in human disease. These results demonstrate the utility of our predictive model and provide insights into the genomic features and molecular mechanisms underlying AAMR.

Figures

Figure 1.
Figure 1.
Alu structure and Alu/Alu-mediated rearrangement (AAMR) event formation. (A) A consensus Alu element is depicted, with both left and right 7SL monomers indicated. A Box, B Box, and A′ Box are internal Pol III promoter elements; the linker is an A-rich sequence; and the element ends in a poly(A) tail. (B) A diagram of an AAMR event is shown: A genomic rearrangement is mediated by a substrate pair of Alu elements followed by the formation of a relatively complete chimeric Alu. Block arrows represent Alu elements on the + (forward arrow) and − (reverse arrow) strand. The 5′ CNV-Alu is colored maroon, and the 3′ CNV-Alu is pink. Ctrl-Alu elements not involved in AAMR are in blue. The microhomology generated at the breakpoint junction after the AAMR event is shown in green.
Figure 2.
Figure 2.
Diagram of the workflow used for predicting CNV-Alu pairs and AAMR hotspot genes in this study. Approximately 1.2 million Alus are documented in the “Repeating Elements by RepeatMasker” track at the UCSC Genome Browser. CNV-Alus are those with experimental evidence supporting their role in AAMR (Supplemental Table S1), and all the others are Ctrl-Alus. We selected Alu pairs that are in the same orientation, span at least one exon, and are located <250 kb from each other. Both the individual Alu sequence features and genomic architectural features were characterized, and a subset of features were utilized in model training. The QDA (quadratic discriminant analysis) model achieved the highest sensitivity and was applied for predicting CNV-Alu pairs. The amount of predicted CNV-Alu pairs is significantly correlated with the number of observed AAMR events for known hotspot genes. Therefore, we further determined the relative risk of AAMR in 12,074 human genes that have a MIM entry using the count of predicted CNV-Alu pairs. Finally, we experimentally validated this prediction with 89 samples selected by correlating predicted hotspot genes with a database of approximately 54,000 chromosomal microarrays (CMAs) by performing aCGH and mapping the breakpoint junctions of detected CNVs. We achieved an 87% positive predictive value overall.
Figure 3.
Figure 3.
Features of CNV-Alu pairs and microhomology preferences. (A) The relative frequency of Alu subfamilies is shown. For example, the AluS-AluY indicates CNVs mediated by Alus from family AluS and AluY respectively, and “Other” indicates monomeric Alus such as FRAMs. We compared the relative frequency of a given subfamily composition of CNV-Alu pairs (in maroon) with that of the expected relative frequency of observing a given subfamily pair (in blue) using the one-tailed binomial test. (**) P ≤ 0.01; (***) P ≤ 0.001. (B) The histogram describes the distribution of microhomology length at breakpoint junctions. (C) The histogram indicates the %GC content within the stretch of microhomology. (D) The figure depicts the collected 219 microhomologies from disease-related studies in human with respect to their relative position on an Alu consensus sequence (lower panel). The peak in the histogram indicates an enrichment of breakpoint junctions on the specific locus. The light blue shading shows a 26-bp core sequence detected by a previous compilation study of Alu-involved gene rearrangements (Rudiger et al. 1995). (E) Adapted from a comparative genomic study on chimpanzee and human reference genome (Han et al. 2007). The blue line describes 492 human-specific breakpoint junctions of Alu/Alu-mediated deletions, and the red line depicts 663 chimpanzee-specific events. The dashed horizontal line indicates the average percentage of breakpoints across the entire Alu element. (F) The schematic shows the construct utilized to detect template switches in yeast. Two human Alu pairs were inserted into Chr II separately with the same distal AluSx element. URA3 and TRP1 are the markers for selecting colonies with successful transformation. We induced a single-strand DNA break at the FRT site using a mutation of FLP recombinase. (G,H) The relative positions of microhomologies generated by mapping junctions from the yeast assay are depicted in relation to an Alu consensus sequence. (G) Data from 503 AAMR events observed in the first AluSx-AluSp strain. (H) Distribution of 114 events from the AluSx-AluY construct.
Figure 4.
Figure 4.
Determining feature enrichment for CNV-Alu pairs with respect to Ctrl-Alu pairs. (A) The comparison of pairwise alignments between CNV-Alus (n = 219) and the corresponding Ctrl-Alu pairs (n = 1000 per CNV-Alu) is shown. The y-axis is a score showing the alignment performance, a higher value of which indicates a better alignment between two sequences. As shown in the key, at each locus, we displayed the alignment score of CNV-Alu with a red dot and showed the distribution of the Ctrl-Alus with a boxplot. The information of all the 219 events is summarized in an increasing order of the median value of the Ctrl-Alus. (B) The distribution of P-values calculated using Monte Carlo simulation for pairwise alignment is shown. (C,D) The same strategy was adopted for analyzing the mean value of the maximum matching score of the PRDM9 targeting motif within an Alu pair.
Figure 5.
Figure 5.
Comparing and selecting machine learning models and the result of a gene-level prediction. (A) The measurement of feature codependency in model training. We tested the error rate for models trained with all selected features (Table 1) as well as by removing one feature at a time (see Methods). (B) The frequency distribution of the gene-level AAMR risk scores for 12,074 OMIM genes. (C) The frequency distribution of the gene-level AAMR risk scores for 133 genes that have been involved in AAMR more than once.
Figure 6.
Figure 6.
Experimental and computational validation of AAMR hotspot prediction. (A) High-density aCGH results from one individual selected from the CMA database shows a duplication of the two terminal exons of CLIP1, a predicted AAMR hotspot gene. Red dots signify probes that indicate relative copy number gain (the region indicated contains a duplication); black dots, a region unaffected by CNV; and green dots, deletion. (B) The UCSC Genome Browser image depicts RefSeq genes and RepeatMasker annotations within the same genomic interval as shown in the aCGH result. The red block represents the duplicated region. The two SINE elements, AluSc8 and AluSx, in which the breakpoints of this CNV are located are marked with red arrows. (C) The first line of sequence shows the reference sequence of the AluSx; the middle line, the sample sequence; and the bottom line, the sequence of the AluSc8. The sequences are on the plus strand, and both Alus are in the plus orientation. The sequence of microhomology at the breakpoint junction is highlighted in red. The gray sequence starts from the first mismatching base. The genomic coordinates of the microhomologies are annotated in the hg19 assembly. (D) A chart summarizing 52 breakpoint junctions mapped at nucleotide level is depicted. The CNVs are grouped into three types: Alu-Alu, CNVs mediated by an Alu pair; Alu-Other, Alu pairing with a non-Alu sequence, including LINE, LCR, and nonrepeat/repetitive sequence, mediates the CNV formation; and Other, no Alu elements were involved. For those mediated by an Alu pair, the QDA prediction result is shown to the right. True prediction indicates these Alu pairs were predicted as high risk for AAMR. (E) A box plot showing the enrichment of genes within different risk score tertiles among three classes of the count of susceptible AAMR CNVs in the CMA database.

Similar articles

See all similar articles

Cited by 12 articles

See all "Cited by" articles

Publication types

LinkOut - more resources

Feedback