Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2016 Aug 3;2(2):vew022.
doi: 10.1093/ve/vew022. eCollection 2016 Jul.

Challenges in the Analysis of Viral Metagenomes

Affiliations
Free PMC article
Review

Challenges in the Analysis of Viral Metagenomes

Rebecca Rose et al. Virus Evol. .
Free PMC article

Abstract

Genome sequencing technologies continue to develop with remarkable pace, yet analytical approaches for reconstructing and classifying viral genomes from mixed samples remain limited in their performance and usability. Existing solutions generally target expert users and often have unclear scope, making it challenging to critically evaluate their performance. There is a growing need for intuitive analytical tooling for researchers lacking specialist computing expertise and that is applicable in diverse experimental circumstances. Notable technical challenges have impeded progress; for example, fragments of viral genomes are typically orders of magnitude less abundant than those of host, bacteria, and/or other organisms in clinical and environmental metagenomes; observed viral genomes often deviate considerably from reference genomes demanding use of exhaustive alignment approaches; high intrapopulation viral diversity can lead to ambiguous sequence reconstruction; and finally, the relatively few documented viral reference genomes compared to the estimated number of distinct viral taxa renders classification problematic. Various software tools have been developed to accommodate the unique challenges and use cases associated with characterizing viral sequences; however, the quality of these tools varies, and their use often necessitates computing expertise or access to powerful computers, thus limiting their usefulness to many researchers. In this review, we consider the general and application-specific challenges posed by viral sequencing and analysis, outline the landscape of available tools and methodologies, and propose ways of overcoming the current barriers to effective analysis.

Keywords: assembly; classification; epidemic; metagenomics; next-generation sequencing; surveillance.

Figures

Figure 1.
Figure 1.
Two widely used methodologies in de novo assembly of short reads. Reads are not represented explicitly within a de Bruijn graph; they are instead decomposed into distinct subsequence ‘words’ of length k, or k-mers, which can be linked together via overlapping k-mers to create an assembly graph. In OLC, a pairwise comparison of all reads is performed, identifying reads with overlapping regions. These overlaps are used to construct a read graph. Next, overlapping reads are bundled into aligned contigs in what is referred to as the layout step, before finally the most likely nucleotide at position is determined through consensus. This figure is simplified to demonstrate the theory for the assembly of single genomes; note that the process has additional complexities for the reconstruction of metagenomes.
Figure 2.
Figure 2.
Proposed DWT signal processing approach for nucleotide sequence analysis. Sequences 1 and 2 are subsequences of the HIV-1 HXB2 genome (the reference genome for HIV), and sequence 3 is a subsequence of the Mycoplasma genitalium genome (all three sequences appear at the bottom of the figure). (A) illustrates the integer number representations of the three sequences—sequence 1 is depicted as a black line, sequence 2 is depicted as a red line and sequence 3 is depicted as a blue line. The sequences are mapped into numerical space with the integer representation method enabling the application of transformation approaches. (B) illustrates the DWT transformations of the three sequences’ numerical representations at varying resolutions. The three sequences are each shown consecutively transformed into six reduced resolution representations. The minor sequence mismatches between sequences 1 and 2 (indicated with green circles) can be easily detected at different transformation resolutions despite reduction in information content from the transformation process. Similar nucleotide sequences give rise to similar DWT transformations and thus can be intuitively identified even at low resolution (level 6), where sequences are represented by a single numerical value. Depicted in (C) are the coefficient matrices obtained from each sequence’s DWT transformation. Coefficient matrices can be used to approximately identify the sites of the mismatch positions between the two sequences. Sequences 1 and 2 differ only at sites 16–17 and 48–49. The exact location of minor differences can be detected at transformation level 4 where each sequence is compressed to four wavelets. Darker colored positions in between the matrices of sequence 1 and 2 indicate matching coefficients, and lighter colored positions indicate dissimilar coefficients.
Figure 3.
Figure 3.
Distinct viral species in the NCBI RefSeq releases from June 2003 – May 2015 (data from ftp://ncbi.nlm.nih.gov/refseq/release/release-statistics/viral.acc_taxid_growth.txt).

Similar articles

See all similar articles

Cited by 16 articles

See all "Cited by" articles

References

    1. Afiahayati K, Sato Y., Sakakibara, (2015) ‘MetaVelvet-SL: An Extension of the Velvet Assembler to a De Novo Metagenomic Assembler Utilizing Supervised Learning’, DNA Research, 22/1: 69–77. - PMC - PubMed
    1. Agrawal R C., Faloutsos A., Swami, (1993) Efficient Similarity Search in Sequence Databases. Heidelberg: Springer.
    1. Altschul S. F. et al. (1990) ‘Basic Local Alignment Search Tool’, Journal of Molecular Biology, 215/3: 403–10. - PubMed
    1. Anastassiou D. (2001) ‘Genomic Signal Processing’, Signal Processing Magazine, IEEE, 18/4: 8–20.
    1. Anthony S. J. et al. (2013) ‘A Strategy to Estimate Unknown Viral Diversity in Mammals’, MBio, 4/5: e00598–13 - PMC - PubMed
Feedback