By sequence homology search, the list of all the functions found and the counts of reads being aligned to them present the functional profile of a metagenomic sample. However, a significant obstacle has been observed in this approach due to the short read length associated with many next generation sequencing technologies. This includes artificial families, cross-annotations, length bias and conservation bias. The widely applied cutoff methods, such as BLAST E-value, are not able to solve the problems. Following the published successful procedures on the artificial families and the cross-annotation issue, we propose in this paper to use zero-truncated Poisson and Binomial (ZTP-Bin) hierarchical modelling to correct the length bias and the conservation bias. Goodness-of-fit of the modelling and cross-validation for the prediction using a bioinformatic simulated sample show the validity of this approach. Evaluated on an in vitro-simulated data set, the proposed modelling method outperforms other traditional methods. All three steps were then sequentially applied on real-life metagenomic samples to show that the proposed framework will lead to a more accurate functional profile of a short read metagenomic sample.
Keywords: Primary 62F10; conservation bias; functional profiling; length bias; metagenomics; secondary 62P10; short reads.