Phylogeny-driven target selection for large-scale genome-sequencing (and other) projects

Stand Genomic Sci. 2013 May 20;8(2):360-74. doi: 10.4056/sigs.3446951. eCollection 2013.

Abstract

Despite the steadily decreasing costs of genome sequencing, prioritizing organisms for sequencing remains important in large-scale projects. Phylogeny-based selection is of interest to identify those organisms whose genomes can be expected to differ most from those that have already been sequenced. Here, we describe a method that infers a phylogenetic scoring independent of which set of organisms has previously been targeted, which is computationally simple and easy to apply in practice. The scoring itself, as well as pre- and post-processing of the data, is illustrated using two real-world examples in which the method has already been applied for selecting targets for genome sequencing. These projects are the JGI CSP Genomic Encyclopedia of Bacteria and Archaea phase I, targeting 1,000 type strains, and, on a smaller-scale, the phylogenomics of the Roseobacter clade. Potential artifacts of the method are discussed and compared to a selection approach based on the taxonomic classification.

Keywords: 16S rRNA; Genomic Encyclopedia; Roseobacter clade; genomics; phylogenetic diversity; taxon selection; tree of life.