Motivation: The whole genomes submitted to GenBank contain valuable information about the function of genes as well as the upstream sequences and whole cell expression provides valuable information on gene regulation. To utilize these large amounts of data for a biological understanding of the regulation of gene expression, new automatic methods for pattern finding are needed.
Results: Two word-analysis algorithms for automatic discovery of regulatory sequence elements have been developed. We show that sequence patterns correlated to whole cell expression data can be found using Kolmogorov-Smirnov tests on the raw data, thereby eliminating the need for clustering co-regulated genes. Regulatory elements have also been identified by systematic calculations of the significance of correlations between words found in the functional annotation of genes and DNA words occurring in their promoter regions. Application of these algorithms to the Saccharomyces cerevisiae genome and publicly available DNA array data sets revealed a highly conserved 9-mer occurring in the upstream regions of genes coding for proteasomal subunits. Several other putative and known regulatory elements were also found.
Availability: Upon request.