Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 11, 26

Gene Overlapping and Size Constraints in the Viral World


Gene Overlapping and Size Constraints in the Viral World

Nadav Brandes et al. Biol Direct.


Background: Viruses are the simplest replicating units, characterized by a limited number of coding genes and an exceptionally high rate of overlapping genes. We sought a unified evolutionary explanation that accounts for their genome sizes, gene overlapping and capsid properties.

Results: We performed an unbiased statistical analysis of ~100 families within ~400 genera that comprise the currently known viral world. We found that the volume utilization of capsids is often low, and greatly varies among viral families. Furthermore, although viruses span three orders of magnitude in genome length, they almost never have over 1500 overlapping nucleotides, or over four significantly overlapping genes per virus.

Conclusions: Our findings undermine the generality of the compression theory, which emphasizes optimal packing and length dependency to explain overlapping genes and capsid size in viral genomes. Instead, we propose that gene novelty and evolution exploration offer better explanations to size constraints and gene overlapping in all viruses.

Reviewers: This article was reviewed by Arne Elofsson and David Kreil.

Keywords: Baltimore groups; Capsid; Icosahedral virion; Open reading frame; VIPERdb; Viral evolution; ViralZone.


Fig. 1
Fig. 1
Overlapping rate is negatively correlated to genome length. a Illustration of overlapping scenarios. The definition of overlapping in this study is restricted to the presence of two genes that overlap in their coding regions while the other parts of the gene are ignored (e.g., 5′ and 3′ UTRs, or intergenic regions). The same applies for the rare cases of viral genes with introns. We consider only pairs of genes that use different ORFs as overlapping genes. It follows that the first example gene (marked S1) overlaps only with Gene 1, while its “overlap” with Gene 2 that shares the same ORF (frame +2) is not considered (the later is considered a trivial overlap). The second example gene (marked S2) demonstrated that a single gene could participate in multiple overlapping events. The third example gene (marked S3) is not involved in any (non-trivial) overlapping event. The light pink marks the only segments of overlapping. For clarity, we identified each ORF by its own color. b A scatter plot demonstrating the negative correlation between genome lengths and overlapping rate in viral families. Both axes are in log scale. 13 families without any overlapping were filtered out (to allow the use of log scale, as had been done in the original work by Belshaw et al. [1] we replicated here ), leaving 80 families out of the complete data set of 93. The families are represented as ellipses, whose width and height correspond to the standard deviation of the genera within them (see Methods). The ellipses are colored by the partition of the families to viral replication groups (see Background). Spearman’s rank correlation: ρ = −0.59, p-value = 6.97·10E-9
Fig. 2
Fig. 2
Overlapping amount is strictly bounded. a A scatter plot showing the absolute number of overlapping nucleotides and genome lengths of all viral families. Only the X-axis is in log scale. Throughout the entire spectrum of genome length, viral genomes have a bounded amount of nucleotides involved in overlapping. Filtered out 3 outlying families (Nimaviridae, Phycodnaviridae and Iridoviridae with 85,155/305,110, 30,798/357,847 and 7956/144,698 overlapping/total nucleotides respectively), leaving 90 shown families. Spearman’s rank correlation is minimal (ρ = 0.26, p-value = 0.015). The dashed lines serve as thresholds (750, 1500 and 3000 nt) that demonstrate the bounded nature of the overlapping amount. Note that most viral families are below these bars. b Of the complete data set of 352 genera, most (273, 329 and 346) have a total number of overlapping nucleotides below the chosen thresholds (750, 1500 and 3000 nt), of which 85 genera (24 %) have no overlapping at all. Although the selection of thresholds is somewhat arbitrary, it can be seen that a saturation point is reached at around 1500 nt
Fig. 3
Fig. 3
The number of significantly overlapping genes is bounded. a A scatter plot demonstrating the number of significantly overlapping genes (SOGs) with respect to genome lengths is shown for 91 of the 93 viral families. Filtered out 2 outlying families (Nimaviridae and Phycodnaviridae with 141 of 532 and 50 of 505 significantly overlapping genes respectively). Only the X-axis is in log scale. Spearman’s rank correlation shows no significance (ρ = −0.08, p-value = 0.43). Most families have less than 4 significantly overlapping genes (dashed line), which account for less than 2 gene pairs. b A scatter plot demonstrating the number of all overlapping genes when no thresholds is used, with respect to genome lengths. Only the X-axis is in log scale. Filtered out 2 outlying families (Nimaviridae and Phycodnaviridae with 489 of 532 and 283 of 505 overlapping genes respectively), leaving 91 shown families. Spearman's rank correlation: ρ = 0.55, p-value = 1.25°10E-8
Fig. 4
Fig. 4
Overlapping amount and genome length are not associated with virion shape. a Showing the same analysis as in Fig. 2a with a different color scheme that highlights the partition between icosahedral and non-icosahedral viruses. Both classes are distributed all over the space. b A quantitative summary of the 90 families in the scatter plot (37 icosahedral and 53 non-icosahedral), showing the overall statistics of the two viral classes in family resolution. The two classes show similar values, in terms of both average and standard deviation
Fig. 5
Fig. 5
Capsid volume usage is often low and varies significantly among viral families. a A scatter plot demonstrating the volume usage (in %) with respect to genome lengths. Only the X-axis is in log scale. The ellipses were created by first calculating the volume usage percentage for each genus separately, and then drawing the families by the distributions of these values. The analysis covers all icosahedral viruses that are associated with detailed 3D information. There are 24 such icosahedral families: 1 – Partitiviridae, 2 – Tymoviridae, 3 – Dicistroviridae, 4 – Rudiviridae, 5 – Bromoviridae, 6 – Togaviridae, 7 – Tectiviridae, 8 – Reoviridae, 9 – Papillomavirida, 10 – Chrysoviridae, 11 – Circoviridae, 12 – Phycodnavirida, 13 – Tombusviridae, 14 – Birnaviridae, 15 – Cystoviridae, 16 – Caliciviridae, 17 – Hepadnaviridae, 18 – Totiviridae, 19 – Leviviridae, 20 – Nodaviridae, 21 – Adenoviridae, 22 – Flaviviridae, 23 – Polyomaviridae, 24 – Picornaviridae. Spearman's rank correlation is not significant: ρ = −0.17, p-value = 0.42. b An arbitrary sample of 10 families presented in (a), demonstrating the proportions of their capsid and genome sizes, from which the volume usage is derived. A single genus was chosen to represent each family, illustrating its capsid (with surface images from VIPERdb) and genome size (showing a bar proportional to its length that also displays the number of strands, and using the color of the relevant viral group). The radii of the capsid images are proportional to their outer radius (although it's the inner radius that determines the volume usage; both are written). Additional structural details (number of capsid subunits and T number) are also shown. The representative genus of each family was chosen by uniform rule - the one with the largest inner radius. This rule also applied for the displayed VIPERdb record

Similar articles

See all similar articles

Cited by 13 articles

See all "Cited by" articles


    1. Belshaw R, Gardner A, Rambaut A, Pybus OG. Pacing a small cage: mutation and RNA viruses. Trends Ecol Evol. 2008;23(4):188–93. doi: 10.1016/j.tree.2007.11.010. - DOI - PubMed
    1. Sabath N, Wagner A, Karlin D. Evolution of viral proteins originated de novo by overprinting. Mol Biol Evol. 2012;29(12):3767–80. doi: 10.1093/molbev/mss179. - DOI - PMC - PubMed
    1. Novella IS, Presloid JB, Taylor RT. RNA replication errors and the evolution of virus pathogenicity and virulence. Curr Opin Virol. 2014;9:143–7. doi: 10.1016/j.coviro.2014.09.017. - DOI - PubMed
    1. Duffy S, Shackelton LA, Holmes EC. Rates of evolutionary change in viruses: patterns and determinants. Nat Rev Genet. 2008;9(4):267–76. doi: 10.1038/nrg2323. - DOI - PubMed
    1. Holland J, Spindler K, Horodyski F, Grabau E, Nichol S, VandePol S. Rapid evolution of RNA genomes. Science. 1982;215(4540):1577–85. doi: 10.1126/science.7041255. - DOI - PubMed

LinkOut - more resources