CSAM: using clustering-hashing-signal anchoring method to explore human novel genes

J Comput Biol. 2006 Dec;13(10):1775-89. doi: 10.1089/cmb.2006.13.1775.

Abstract

The expression of genes in mammalian cells can be constitutive, transient, or inducible. Transcripts of transient and inducible genes are difficult to discover using the EST approach. Transiently expressed genes, however, are crucial to embryo development and the pathogenesis of disease because they determine the outcome of disease. Using our new bioinformatics approach, which we believe will facilitate verification of novel transcripts in developing embryos or pathogen-induced cells; we aimed to identify novel exons in transiently expressed genes. First of all, the proposed method uses a general gene predictor that must be able to produce all possibly optimal or suboptimal candidate exons in human. After applying signal processing, an anchoring procedure in the method transforms and groups the candidate sequences into many numeric hashing-signals clusters rapidly. In the meanwhile, an entropy-based theorem in the method can be used to remove the most error matches, repeat matches. Finally, the method generates the resulting exons identified by alignment with other genomic or EST sequence in cross-species. Our results indicated that 3,223 filtered target exons were potential novel exons. The theoretical threshold determined using the computational method for filtering repeat matches had 95.3% sensitivity and 81.8% specificity. The inferential threshold, however, was close to the experimental threshold, which is a practical expected value for considering both sensitivity and specificity. Therefore, our results proved the feasibility of the method. Combining the anchoring method embedded an entropy-based filter with an inherently unreliable gene predictor can be used to obtain a small scope of exons that may be potentially novel because the combination avoids many drawbacks of some traditional gene predictors.

MeSH terms

  • Algorithms
  • Cluster Analysis
  • Exons
  • Expressed Sequence Tags
  • Feasibility Studies
  • Gene Expression Profiling
  • Genome, Human*
  • Humans
  • Sequence Analysis