Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 10, 31
eCollection

AluMine: Alignment-Free Method for the Discovery of Polymorphic Alu Element Insertions

Affiliations

AluMine: Alignment-Free Method for the Discovery of Polymorphic Alu Element Insertions

Tarmo Puurand et al. Mob DNA.

Abstract

Background: Recently, alignment-free sequence analysis methods have gained popularity in the field of personal genomics. These methods are based on counting frequencies of short k-mer sequences, thus allowing faster and more robust analysis compared to traditional alignment-based methods.

Results: We have created a fast alignment-free method, AluMine, to analyze polymorphic insertions of Alu elements in the human genome. We tested the method on 2,241 individuals from the Estonian Genome Project and identified 28,962 potential polymorphic Alu element insertions. Each tested individual had on average 1,574 Alu element insertions that were different from those in the reference genome. In addition, we propose an alignment-free genotyping method that uses the frequency of insertion/deletion-specific 32-mer pairs to call the genotype directly from raw sequencing reads. Using this method, the concordance between the predicted and experimentally observed genotypes was 98.7%. The running time of the discovery pipeline is approximately 2 h per individual. The genotyping of potential polymorphic insertions takes between 0.4 and 4 h per individual, depending on the hardware configuration.

Conclusions: AluMine provides tools that allow discovery of novel Alu element insertions and/or genotyping of known Alu element insertions from personal genomes within few hours.

Keywords: Alignment-free sequence analysis; Alu repeat element; Mobile element insertions.

Conflict of interest statement

Competing interestsThe authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Principle of creating k-mer pairs for the calling (genotyping) of polymorphic Alu element insertions. a Genomic regions with or without an Alu element. b A pair of 32-mers is created from the insertion breakpoint region covering 25 nucleotides from the 5′-flanking region and 7 nucleotides from either the Alu element or the 3′-flanking region. Allele A always represents the sequence from the reference genome and allele B represents the alternative, non-reference allele
Fig. 2
Fig. 2
Overview of the discovery methods. Potential polymorphic Alu elements were identified from the raw reads of high-coverage WGS data (REF– Alu elements) and the reference genome (REF+ Alu elements). The candidate Alu elements were filtered using a subset of high-coverage individuals. A final set of 32-mers was used for the fast calling of polymorphic insertions from raw sequencing reads
Fig. 3
Fig. 3
a The number of discovered REF– Alu elements in individual NA12877 depending on the depth of coverage. Various depth coverage levels were generated by randomly selecting a subset of reads from the FASTQ file. b The frequency of false-negative Alu elements found in simulations. FN1 denotes false negatives that could not be detected because they are inserted in nonunique regions of the genome. FN2 denotes false-negative findings that were undetectable because they are inserted within unsequenced regions of the genome (N-rich regions). Error bars indicate 95% confidence intervals from 20 replicates
Fig. 4
Fig. 4
Overlap between REF+ and REF– elements detected by different methods from an individual NA12878. The Venn diagram was created with BioVenn software [42]
Fig. 5
Fig. 5
Histogram showing the distribution of the number of non-reference REF– (light) and REF+ (dark) elements discovered per individual genome in 2,241 test individuals from the Estonian Genome Project
Fig. 6
Fig. 6
Cumulative frequency of REF– Alu elements discovered from studied individuals
Fig. 7
Fig. 7
A gel electrophoretic image showing the experimental validation of polymorphic Alu element insertion (REF– elements). One polymorphic Alu element from chr8:42039896 was tested by PCR in DNA from 61 individuals. Lower bands show the absence of an Alu insertion (reference allele A), and upper bands show its presence (alternative allele B)
Fig. 8
Fig. 8
A gel electrophoretic image showing the experimental validation of REF+ polymorphic Alu element insertions. Three locations from chr1:169160349, chr15:69049897 and chr3:95116523 were tested by PCR in DNA from 61 individuals. Upper bands show the presence of an Alu insertion (reference allele A), and lower bands show its absence (alternative allele B)

Similar articles

See all similar articles

References

    1. de Koning APJ, Gu W, Castoe TA, Batzer MA, Pollock DD. Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet. 2011;7:e1002384. doi: 10.1371/journal.pgen.1002384. - DOI - PMC - PubMed
    1. Wheeler TJ, Clements J, Eddy SR, Hubley R, Jones TA, Jurka J, et al. Dfam: a database of repetitive DNA based on profile hidden Markov models. Nucleic Acids Res. 2013;41:D70–D82. doi: 10.1093/nar/gks1265. - DOI - PMC - PubMed
    1. Hubley R, Finn RD, Clements J, Eddy SR, Jones TA, Bao W, et al. The Dfam database of repetitive DNA families. Nucleic Acids Res. 2016;44:D81–D89. doi: 10.1093/nar/gkv1272. - DOI - PMC - PubMed
    1. Tang W, Mun S, Joshi A, Han K, Liang P. Mobile elements contribute to the uniqueness of human genome with 15,000 human-specific insertions and 14 Mbp sequence increase. DNA Res. 2018;25:521–533. doi: 10.1093/dnares/dsy022. - DOI - PMC - PubMed
    1. Houck CM, Rinehart FP, Schmid CW. A ubiquitous family of repeated DNA sequences in the human genome. J Mol Biol. 1979;132:289–306. doi: 10.1016/0022-2836(79)90261-4. - DOI - PubMed

LinkOut - more resources

Feedback