Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 42 (18), e141

leeHom: Adaptor Trimming and Merging for Illumina Sequencing Reads


leeHom: Adaptor Trimming and Merging for Illumina Sequencing Reads

Gabriel Renaud et al. Nucleic Acids Res.


The sequencing of libraries containing molecules shorter than the read length, such as in ancient or forensic applications, may result in the production of reads that include the adaptor, and in paired reads that overlap one another. Challenges for the processing of such reads are the accurate identification of the adaptor sequence and accurate reconstruction of the original sequence most likely to have given rise to the observed read(s). We introduce an algorithm that removes the adaptors and reconstructs the original DNA sequences using a Bayesian maximum a posteriori probability approach. Our algorithm is faster, and provides a more accurate reconstruction of the original sequence for both simulated and ancient DNA data sets, than other approaches. leeHom is released under the GPLv3 and is freely available from:


Figure 1.
Figure 1.
Schematic representation of paired-end sequencing for very short molecules. (a) When the molecule is shorter than the read length, both reads will run into the adaptors and the remaining part will completely overlap. (b) If the sequence is longer but still not longer than twice the read length, adaptor sequences will be absent but a partial overlap can be observed between the end of the sequences.
Figure 2.
Figure 2.
Empirical (black) and theoretical (red) length distributions of ancient and modern DNA libraries. Presented is the output of the maximum likelihood fit from the Fitdistrplus R package using a log-normal distribution for an aDNA library (left) and a modern DNA library (right). aDNA molecules tend to be of shorter length with a much narrower variance than modern DNA.
Figure 3.
Figure 3.
The log-likelihood for various possibilities of length of the original molecules for an ancient and modern DNA read pair. The dotted line represents the likelihood that the reads do not merge and that they came from a molecule of length greater than the longest possible overlap. For the aDNA read pairs, a particular length of the original molecule is more likely than the remaining possibilities. This is not the case for modern DNA read pairs due to the longer length of the original molecule.
Figure 4.
Figure 4.
Comparison of the fraction for all input reads of reconstructed sequences as a function of simulated error rate for the output of leeHom and currently available software for sequence reconstruction based on paired-end read data. The number of perfectly reconstructed sequences (left), the ones with a single mismatch (mm) to the original sequence (center) and those with the correct length (right) are presented. Both in terms of perfectly reconstructed sequences and in terms of sequences with the correct length, leeHom outperforms other currently available algorithms.

Similar articles

See all similar articles

Cited by 52 PubMed Central articles

See all "Cited by" articles


    1. Gansauge M.-T., Meyer M. Single-stranded DNA library preparation for the sequencing of ancient or damaged DNA. Nat. Protoc. 2013;8:737–748. - PubMed
    1. Sawyer S., Krause J., Guschanski K., Savolainen V., Pääbo S. Temporal patterns of nucleotide misincorporations and DNA fragmentation in ancient DNA. PLoS One. 2012;7:e34131. - PMC - PubMed
    1. Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. J. 2011;17:10.
    1. Kong Y. Btrim: a fast, lightweight adapter and quality trimming program for next-generation sequencing technologies. Genomics. 2011;98:152–153. - PubMed
    1. Kircher M. Analysis of high-throughput ancient DNA sequencing data. In: Shapiro B., Hofreiter M., editors. Ancient DNA: methods and protocols. Humana Press: Springer; 2012. pp. 197–228. - PubMed

Publication types