A global compositional complexity measure for biological sequences: AT-rich and GC-rich genomes encode less complex proteins

Comput Chem. 2000 Jan;24(1):71-94. doi: 10.1016/s0097-8485(99)00048-0.

Abstract

Different local regions of natural amino acid or nucleotide sequences show remarkable heterogeneity in residue composition, reflecting diversity in evolutionary history and physiochemical constraints. Compositional complexity measures are helpful for describing and understanding this variegation. Motivated by some open problems in comparative genomics and protein folding, we have developed a new 'global' compositional complexity measure, G1, which overcomes a crucial limitation of earlier methods. The 'local' measures used in previous research resemble entropy functions and are inherently dependent on an underlying probability distribution. Local measures cannot rigorously compare complexity across sequences of substantially different size, because real sequences show very irregular heterogeneity and do not have the necessary ergodicity in scaling and asymptotic properties. G1 is a member of a new class of scale-independent, distribution-independent complexity functions. For a sequence S of length L on an N-letter alphabet, G1 is derived from ratios in the integer partition lattice, P¿L,N¿ of L with N parts, where the elements of P¿L,N¿ are the state vectors of S, (n1, n2,..., nN), ranked by an order principle. We present theorems and proofs relating to the metric properties of G1 and its relationship to other state-vector-dependent compositional complexity functions, together with a fully-efficient O(L) algorithm to compute G1. The distributions of G1 were calculated for the entire sets of translated proteins encoded by extensively sequenced genomes. The results establish the existence of a clear evolutionary principle, common to bacteria, archaea and eukaryotes, that the proteins encoded by more extreme AT-rich and GC-rich genomes have generally lower compositional complexity than those of more typical organisms.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Computer Simulation
  • DNA / chemistry*
  • Evolution, Molecular
  • Genome*
  • Proteins / chemistry

Substances

  • Proteins
  • DNA