Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 May 31;288(22):16127-38.
doi: 10.1074/jbc.M113.451500. Epub 2013 Mar 25.

Structure Prediction and Analysis of DNA Transposon and LINE Retrotransposon Proteins

Affiliations
Free PMC article

Structure Prediction and Analysis of DNA Transposon and LINE Retrotransposon Proteins

György Abrusán et al. J Biol Chem. .
Free PMC article

Abstract

Despite the considerable amount of research on transposable elements, no large-scale structural analyses of the TE proteome have been performed so far. We predicted the structures of hundreds of proteins from a representative set of DNA and LINE transposable elements and used the obtained structural data to provide the first general structural characterization of TE proteins and to estimate the frequency of TE domestication and horizontal transfer events. We show that 1) ORF1 and Gag proteins of retrotransposons contain high amounts of structural disorder; thus, despite their very low conservation, the presence of disordered regions and probably their chaperone function is conserved. 2) The distribution of SCOP classes in DNA transposons and LINEs indicates that the proteins of DNA transposons are more ancient, containing folds that already existed when the first cellular organisms appeared. 3) DNA transposon proteins have lower contact order than randomly selected reference proteins, indicating rapid folding, most likely to avoid protein aggregation. 4) Structure-based searches for TE homologs indicate that the overall frequency of TE domestication events is low, whereas we found a relatively high number of cases where horizontal transfer, frequently involving parasites, is the most likely explanation for the observed homology.

Keywords: Evolution; Gene Transposable Elements; Horizontal Transfer; Intrinsically Disordered Proteins; Protein Evolution; Protein Folding; RNA World; Transposon Domestication.

Figures

FIGURE 1.
FIGURE 1.
Domain annotation, structure prediction, and quality determination of the protein models using the ORF2 protein of the human L1HS retrotransposon as an example. A, the ORF2 protein of the human L1HS transposon is 1275 amino acids long, thus, a reliable model of the entire protein could not be built with current methods. Using a profile built from SCOP sequences similar to the target protein and annotation of conserved domains, we split the sequence into three regions, corresponding to the functional units of the protein: the endonuclease region, reverse-transcriptase region, and a cysteine-rich region with an unknown function (contains the Pfam conserved domain DUF1725). B, the I-TASSER protein model of the endonuclease region of the protein and the local r.m.s.d. distribution. Because a solved experimental structure for the human L1 endonuclease region is available in PDB, the structure of residues 1–235 is essentially similar to it, is characterized by very low r.m.s.d., and has a correct structure (TM score, 0.97; blue), whereas for the remaining 100 amino acids of the region, I-TASSER was not able to build a high quality structure. C, the predicted structure of the reverse-transcriptase region and the distribution of local r.m.s.d. values. The overall quality of the structure is low (estimated TM score, 0.33); however, the r.m.s.d. distribution shows a clear dip at the reverse transcriptase conserved domain, and the quality of the structure for this 170-residue region (residues 235–405) is essentially correct (highlighted with blue), with an estimated TM score of 0.59. D, the predicted structure of the region with the cysteine-rich domain and the distribution of the local r.m.s.d. values. The structure has a somewhat better overall quality than the reverse transcriptase region (TM score of 0.41) and can be split to several regions with low local r.m.s.d. (highlighted in red, blue, and green), which improves local TM scores to 0.45, 0.46, and 0.41, respectively.
FIGURE 2.
FIGURE 2.
Coiled-coil and intrinsically disordered regions in TE proteins. A, the fraction of coiled-coil sequence in different TEs. ORF1 proteins of CR1, L1, and L2 families of LINEs are characterized by much higher amounts of coiled-coil sequence than ORF2 proteins or proteins of LTR retrotransposons. B, coiled-coil regions in the ORF1/Gag proteins are present near the N terminus of the sequence. C, the fraction of disordered sequence predicted wit IUpred in different TE protein types. ORF1/Gag proteins are characterized by ∼5-fold higher amount of disordered sequence than ORF2 proteins of LINEs, LTR polyproteins, or DNA transposases. D, the distribution of disordered regions along the sequence of ORF1 proteins of LINEs and LTR Gag proteins.
FIGURE 3.
FIGURE 3.
Distribution of C scores of the low r.m.s.d. regions of the TE structures. 61% of LINE low r.m.s.d. regions and 39% of DNA transposon low r.m.s.d. regions have a C-score higher than −1.78; thus, their estimated TM score is higher than 0.5.
FIGURE 4.
FIGURE 4.
The SCOP class composition of TE proteins. DNA transposons are characterized mostly by all-α domains and α/β domains, whereas LINEs by multidomain hits and α+β domains (see also supplemental Table 5).
FIGURE 5.
FIGURE 5.
The proteins of DNA transposons contain more ancient SCOP folds than LINE retrotransposons. The age of protein folds is measured as node distance, a measure based on the phylogenetic spread of the fold; the larger the node distance, the younger the particular fold is, i.e. the more distant from the most ancient protein folds on the phylogenomic tree (see Ref. for details). The histogram shows that in DNA transposons, the most abundant protein folds are among the most ancient ones, which were already present before the appearance of the first cellular organisms (∼4 Bya), whereas the most frequent folds in LINEs were invented later, approximately at the time of the specification of the three superkingdoms (∼3 Bya), suggesting that DNA substituted RNA as the carrier of genetic information already in the early Archean period.
FIGURE 6.
FIGURE 6.
Contact order of TE proteins. A, the contact order of proteins of DNA transposons is significantly lower than the contact order of the reference CASP9 proteins (p ≪ 0.001, ANCOVA), indicating that DNA transposons are under selection to fold rapidly. B, LINEs (non-LTR retrotransposons) do not show the same pattern (p = 0.31, ANCOVA). C, contact order of highly soluble and poorly soluble (prone to aggregation) E. coli proteins and DNA transposons.
FIGURE 7.
FIGURE 7.
The contact order of DNA transposon folds is low even within the same superfamilies. A, correlations between length and absolute contact order, for SCOP families present in the high quality DNA transposon structures, and all other families from the same SCOP superfamilies. B, although the difference is small, SCOP families in DNA transposons have significantly lower contact order than other families from the same SCOP superfamilies (ANCOVA, p = 0.02409).
FIGURE 8.
FIGURE 8.
The relationship between the quality of a PFP structure and the likelihood of detecting structural similarity/homology with a TE. A, the probability of detecting homology between the TE structures and the proteome folding project structures depends largely on the quality (MCM score) of PFP decoys. Structures with an MCM score of 0.8 have mostly correct topology (two of three are correct), whereas below MCM score 0.4, their quality is low and are mostly incorrect. The quality of PFP structures has a very large effect on the number of detected homologs, which is the result of two independent processes: 1) the probability of detecting structural similarity (TM score > 0.5) between TE and PFP structures increases radically with the increasing quality (MCM score) of the PFP structure (B); 2) in the identified similar structure pairs, the fraction of pairs with significant sequence similarity (p < 0.001) also increases with the quality of the structures, although less dramatically (∼2.5-fold; C). This has two consequences for the estimation of false positive rate of homolog detection. First, the fraction of incorrectly detected structure pairs due to modeling errors is probably low: we detect real analogs and homologs (B). However, based on the fraction of cases with significant sequence similarity (p < 0.001) where the structural similarity between a TE and a PFP decoy is likely to be an artifact (C; MCM score of PFP structures < 0.4), the number of homologs is probably overestimated, with up to 40%.

Similar articles

See all similar articles

Cited by 5 articles

Publication types

MeSH terms

Substances

LinkOut - more resources

Feedback