A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads

Le Van Vinh; Tran Van Lang; Le Thanh Binh; Tran Van Hoai

doi:10.1186/s13015-014-0030-4

A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads

Algorithms Mol Biol. 2015 Jan 16;10(1):2. doi: 10.1186/s13015-014-0030-4. eCollection 2015.

Authors

Le Van Vinh¹, Tran Van Lang², Le Thanh Binh³, Tran Van Hoai¹

Affiliations

¹ Faculty of Computer Science and Engineering, HCMC University of Technology, 268 Ly Thuong Kiet, Q10, Ho Chi Minh City, Vietnam.
² Institute of Applied Mechanics and Informatics, Vietnam Academy of Science and Technology (VAST), 01 Mac Dinh Chi, Q1, Ho Chi Minh City, Vietnam ; Faculty of Information Technology, Lac Hong University, 10 Huynh Van Nghe, Bien Hoa, Dong Nai Vietnam.
³ Institute of Biotechnology, Vietnam Academy of Science and Technology (VAST), 18 Hoang Quoc Viet, Cau Giay, Ha Noi Vietnam.

Abstract

Background: Metagenomics is the study of genetic materials derived directly from complex microbial samples, instead of from culture. One of the crucial steps in metagenomic analysis, referred to as "binning", is to separate reads into clusters that represent genomes from closely related organisms. Among the existing binning methods, unsupervised methods base the classification on features extracted from reads, and especially taking advantage in case of the limitation of reference database availability. However, their performance, under various aspects, is still being investigated by recent theoretical and empirical studies. The one addressed in this paper is among those efforts to enhance the accuracy of the classification.

Results: This paper presents an unsupervised algorithm, called BiMeta, for binning of reads from different species in a metagenomic dataset. The algorithm consists of two phases. In the first phase of the algorithm, reads are grouped into groups based on overlap information between the reads. The second phase merges the groups by using an observation on l-mer frequency distribution of sets of non-overlapping reads. The experimental results on simulated and real datasets showed that BiMeta outperforms three state-of-the-art binning algorithms for both short and long reads (≥700 b p) datasets.

Conclusions: This paper developed a novel and efficient algorithm for binning of metagenomic reads, which does not require any reference database. The software implementing the algorithm and all test datasets mentioned in this paper can be downloaded at http://it.hcmute.edu.vn/bioinfo/bimeta/index.htm.

Keywords: Algorithm; Binning; Metagenomics; Next-generation sequencing; l-mers frequency.