Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Jan 31;114(5):E733-E740.
doi: 10.1073/pnas.1619797114. Epub 2017 Jan 17.

Human Transposon Insertion Profiling: Analysis, Visualization and Identification of Somatic LINE-1 Insertions in Ovarian Cancer

Affiliations
Free PMC article

Human Transposon Insertion Profiling: Analysis, Visualization and Identification of Somatic LINE-1 Insertions in Ovarian Cancer

Zuojian Tang et al. Proc Natl Acad Sci U S A. .
Free PMC article

Abstract

Mammalian genomes are replete with interspersed repeats reflecting the activity of transposable elements. These mobile DNAs are self-propagating, and their continued transposition is a source of both heritable structural variation as well as somatic mutation in human genomes. Tailored approaches to map these sequences are useful to identify insertion alleles. Here, we describe in detail a strategy to amplify and sequence long interspersed element-1 (LINE-1, L1) retrotransposon insertions selectively in the human genome, transposon insertion profiling by next-generation sequencing (TIPseq). We also report the development of a machine-learning-based computational pipeline, TIPseqHunter, to identify insertion sites with high precision and reliability. We demonstrate the utility of this approach to detect somatic retrotransposition events in high-grade ovarian serous carcinoma.

Keywords: LINE-1; TIPseq; human; ovarian cancer; retrotransposon.

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig. 1.
Fig. 1.
LINE-1 insertions and vectorette PCR. (A) Full-length LINE-1 (L1) insertion is diagrammed; the LINE-1 spans 6 kb and includes two ORFs (ORF1 and ORF2). The element ends with a 3′ series of adenine nucleobases and the pA tail of its RNA precursor, and it is flanked by TSDs of the preinsertion genomic sequence (red boxes). (B, Top to Bottom) Vectorette PCR work flow. Genomic DNA (parallel lines) is cut with restriction enzymes, leaving sticky ends (downward-facing arrows); vectorette adapters (blue) are ligated to these ends. The annealed vectorette sequences are not perfectly complementary, and no binding site exists for the amplification primer at the outset of the PCR assay. First-strand extensions (sea green) occur from a forward primer specific for L1Hs LINE-1 (black, rightward-facing top arrow), and in subsequent iterations of the PCR, the reverse amplification primer has its complement from these strands (black, leftward-facing bottom arrow). The structure of the resulting amplicons, along with possibilities for corresponding paired end sequencing reads, is shown. Informative reads can be grouped into categories depending on their positions in the TIPseq amplicons. These positions include L1/junction read pairs, L1/genome read pairs, junction/genome, and genome/genome read pairs. L1/L1 concordant read pairs are not informative.
Fig. 2.
Fig. 2.
Schematic of the TIPseqHunter pipeline. There are five steps in the pipeline: (i) low-quality sequences, base pairs, and vectorette sequences are trimmed using Trimmomatic software; (ii) qualified read pairs are aligned to an L1Hs masked reference genome (hg19) and the L1Hs consensus sequence using Bowtie2 software; (iii) candidate insertion sites are identified using the enriched target sites with at least one junction-containing read pair; (iv) a machine-learning model is built using five features (width, depth, variant index, pA tail purity, and number of junction reads); and (v) the trained model is used to predict probabilities of the candidate insertions being the true insertion sites.
Fig. 3.
Fig. 3.
Training and evaluation of the model. (1) To identify if a candidate insertion site is a true insertion site, a dataset labeled with true and false insertion sites (the labeled set) is constructed for each sequencing sample. Positive instances (true insertion sites) in the labeled set are identified by matching to one of the two annotated LINE-1 lists (fixed present and RepeatMasker). (2) Negative instances (false insertion sites) in the labeled set are defined as candidates missing the first 5′-most position of the L1 amplification primer. (3) Labeled set comprises positive instances and negative instances, with “1” representing a positive instance and “0” representing a negative instance. Five selected features extracted from the sequencing data are obtained for each instance and will be used to construct the predictive model. (4) Labeled set is split into a training set (70%) and test set (30%) randomly. (5) Predictive model is built with logistic regression on the training set to establish the relationship between the characteristics in sequencing data and the instance type (i.e., whether a candidate insertion site is a true insertion site). (6) Resulting predictive model is applied to the test set to predict the instance type. (7) Instance type predicted by the model is compared with the true instance type of the test set to evaluate the performance of the predictive model (measured by accuracy). If the model performance is not satisfactory on the test set, applying the predictive model on the unlabeled dataset for novel insertion set discovery is not recommended. (8) Predictive model is applied to the unlabeled set to predict the probability of a candidate insertion site being a true transposon insertion site.
Fig. 4.
Fig. 4.
Model parameters. The distribution of five model parameters (width and depth in base pairs of the enriched target region; variant index, alignment mismatches and indels; pA tail purity at each predicted LINE-1 3′ end; and the number of junction reads supporting the predicted site). Width, depth, and junction reads are all log2-based values. Width and depth determine the placement of each point on the x and y axes. The variant index is shown as the color of the data point fill. The pA purity is shown as the color of the data point outline. The number of junction reads is depicted as the size of the data point. (A) Negative instances (Left), positive instances (Center), and unlabeled instances (Right) when a fixed present set of L1PA1 is used to train the model. (B) Predicted probabilities that candidate insertions are true LINE-1 insertions in five increments of P = 0.02, and then P < 0.9 (rightmost). (C) Instances when the RepeatMasker set is used to train the model. More insertions are included as positive instances for training compared with A. (D) Predicted probabilities associated with C.
Fig. 5.
Fig. 5.
Model performance. (A) Accuracies for a set of matched germ-line, primary pancreatic tumor, and metastatic tumor samples. Accuracies are the highest when the fixed present L1PA1 set is used to define positive instances for training models (circles). Accuracies for the RepeatMasker trained models range from 0.90 to 0.98 (squares). (B) Number of insertions detected (labeled positive plus unlabeled instances) with P > 0.99 as predicted by the models when training on the fixed present and RepeatMasker sets.
Fig. S1.
Fig. S1.
Percentage of uniquely mapped reads (blue) and concordant aligned reads (red) in each target region for fixed present insertions. Loci are sorted left-to-right in order of the percentage of unique mapped reads for the corresponding target region. Where mappability of reads is low (left-hand side), there are comparable proportions of concordant read pairs as where mappability of reads is high (right-hand side). Overall, the low percentage of concordant read pairs stems from our masking the reference copies of L1Hs. A read overlaying the 3′ of the LINE-1 insertion will not be concordantly mapped with its mate that maps to adjacent genomic sequence.
Fig. S2.
Fig. S2.
Percentage of insertions detected as a function of the number of read pairs sequenced (P ≥ 0.5; n = 44 somatically acquired, PCR-validated LINE-1 insertions). The detection rate is reduced and becomes unstable when the sequencing reads are decreased; in this sample, the effect is pronounced at <10 million reads. The effect is greater when a smaller set of 200 fixed present L1Hs insertions are used to build the model (Top) as opposed to when a larger set of 1,544 RepeatMasker annotated insertions is used (Bottom).
Fig. 6.
Fig. 6.
Somatic LINE-1 insertions in ovarian cancer. (A) Positions of somatic insertions observed in HGSC shown on a chromosomal ideogram (red marks). To the lower right, a schematic shows the structure of the BRCA1 gene at 17q21.31 and the location of a somatically acquired, intronic LINE-1 insertion. The 593-bp LINE-1 is 5′ truncated and includes a portion of the ORF2p ORF, the LINE-1 3′ UTR, and a pA tail (red); it is flanked by TSDs (white boxes). (BE) TranspoScope view of the evidence for two insertions. (B and C) L1(Ta) at chr6:136,712,694 ± 3 at two different magnifications. (D and E) LINE-1 insertion at chr17:41,250,393 ± 1 in BRCA1. (B and D) Distribution of genome/genome read pairs (gray); genome/L1 read pairs (purple/blue), junction reads (orange), and all reads overlaid. (C and E) Sequence of the junction reads.

Similar articles

See all similar articles

Cited by 23 articles

See all "Cited by" articles

Publication types

Feedback