Background: One objective of metagenomics is to reconstruct information about specific uncultured organisms from fragmentary environmental DNA sequences. We used the genome of an isolate of the marine alphaproteobacterium SAR11 ('Candidatus Pelagibacter ubique'; strain HTCC1062), obtained from the cold, productive Oregon coast, as a query sequence to study variation in SAR11 metagenome sequence data from the Sargasso Sea, a warm, oligotrophic ocean gyre.
Results: The average amino acid identity of SAR11 genes encoded by the metagenomic data to the query genome was only 71%, indicating significant evolutionary divergence between the coastal isolates and Sargasso Sea populations. However, an analysis of gene neighbors indicated that SAR11 genes in the Sargasso Sea metagenomic data match the gene order of the HTCC1062 genome in 96% of cases (> 85,000 observations), and that rearrangements are most frequent at predicted operon boundaries. There were no conserved examples of genes with known functions being found in the coastal isolates, but not the Sargasso Sea metagenomic data, or vice versa, suggesting that core regions of these diverse SAR11 genomes are relatively conserved in gene content. However, four hypervariable regions were observed, which may encode properties associated with variation in SAR11 ecotypes. The largest of these, HVR2, is a 48 kb region flanked by the sole 5S and 23S genes in the HTCC1062 genome, and mainly encodes genes that determine cell surface properties. A comparison of two closely related 'Candidatus Pelagibacter' genomes (HTCC1062 and HTCC1002) revealed a number of "gene indels" in core regions. Most of these were found to be polymorphic in the metagenomic data and showed evidence of purifying selection, suggesting that the same "polymorphic gene indels" are maintained in physically isolated SAR11 populations.
Conclusion: These findings suggest that natural selection has conserved many core features of SAR11 genomes across broad oceanic scales, but significant variation was found associated with four hypervariable genome regions. The data also led to the hypothesis that some gene insertions and deletions might be polymorphisms, similar to allelic polymorphisms.