Exploiting hidden information interleaved in the redundancy of the genetic code without prior knowledge

Bioinformatics. 2015 Apr 15;31(8):1161-8. doi: 10.1093/bioinformatics/btu797. Epub 2014 Nov 29.

Abstract

Motivation: Dozens of studies in recent years have demonstrated that codon usage encodes various aspects related to all stages of gene expression regulation. When relevant high-quality large-scale gene expression data are available, it is possible to statistically infer and model these signals, enabling analysing and engineering gene expression. However, when these data are not available, it is impossible to infer and validate such models.

Results: In this current study, we suggest Chimera-an unsupervised computationally efficient approach for exploiting hidden high-dimensional information related to the way gene expression is encoded in the open reading frame (ORF), based solely on the genome of the analysed organism. One version of the approach, named Chimera Average Repetitive Substring (ChimeraARS), estimates the adaptability of an ORF to the intracellular gene expression machinery of a genome (host), by computing its tendency to include long substrings that appear in its coding sequences; the second version, named ChimeraMap, engineers the codons of a protein such that it will include long substrings of codons that appear in the host coding sequences, improving its adaptation to a new host's gene expression machinery. We demonstrate the applicability of the new approach for analysing and engineering heterologous genes and for analysing endogenous genes. Specifically, focusing on Escherichia coli, we show that it can exploit information that cannot be detected by conventional approaches (e.g. the CAI-Codon Adaptation Index), which only consider single codon distributions; for example, we report correlations of up to 0.67 for the ChimeraARS measure with heterologous gene expression, when the CAI yielded no correlation.

Availability and implementation: For non-commercial purposes, the code of the Chimera approach can be downloaded from http://www.cs.tau.ac.il/∼tamirtul/Chimera/download.htm.

Contact: tamirtul@post.tau.ac.il

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Codon / genetics*
  • Computational Biology / methods
  • Escherichia coli / genetics*
  • Escherichia coli Proteins / genetics*
  • Escherichia coli Proteins / metabolism
  • Gene Expression Regulation, Bacterial*
  • Genome, Bacterial*
  • Open Reading Frames / genetics*
  • Protein Biosynthesis

Substances

  • Codon
  • Escherichia coli Proteins