Distinctive sequence features in protein coding genic non-coding, and intergenic human DNA

J Mol Biol. 1995 Oct 13;253(1):51-60. doi: 10.1006/jmbi.1995.0535.

Abstract

We have studied the behavior of a number of sequence statistics, mostly indicative of protein coding function, in a large set of human clone sequences randomly selected in the course of genome mapping (randomly selected clone sequences), and compared this with the behavior in known sequences containing genes (which we term genic sequences). As expected, given the higher coding density of the genic sequences, the sequence statistics studied behave in a substantially different manner in the randomly selected clone sequences (mostly intergenic DNA) and in the genic sequences. Strong differences in behavior of a number of such statistics are also observed, however when the randomly selected clone sequences are compared with only the non-coding fraction of the genic sequences, suggesting that intergenic and genic non-coding DNA constitute two different classes of non-coding DNA. By studying the behavior of the sequence statistics in simulated DNA of different C+G content, we have observed that a number of them are strongly dependent on C+G content. Thus, most differences between intergenic and genic non-coding DNA can be explained by differences in C+G content. A+T-rich intergenic DNA appears to be at the compositional equilibrium expected under random mutation, while C+G richer non-coding genic DNA is far from this equilibrium. The results obtained in simulated DNA indicate, on the other hand, that a very large fraction of the variation in the coding statistics that underlie gene identification algorithms is due simply to C+G content, and is not directly related to protein coding function. It appears, thus, that the performance of gene-finding algorithms should be improved by carefully distinguishing the effects of protein coding function from those of mere base compositional variation on such coding statistics.

Publication types

  • Research Support, U.S. Gov't, Non-P.H.S.
  • Research Support, U.S. Gov't, P.H.S.

MeSH terms

  • Algorithms
  • Base Composition
  • Base Sequence / genetics*
  • DNA / genetics*
  • Databases, Factual
  • Discriminant Analysis
  • Genes / genetics*
  • Humans
  • Open Reading Frames / genetics
  • Proteins / genetics

Substances

  • Proteins
  • DNA