Systematic detection of statistically overrepresented DNA motif association rules

Genome Inform. 2006;17(1):124-33.

Abstract

DNA motifs, or cis-elements, are short nucleotide sequence patterns recognized by various transcription factors (TFs). In promoters, these TFs bind in a complex combinatorial manner in order to regulate the expression of a downstream gene. The combinatorial space is frequently large and difficult to manage since vertebrates have thousands of transcription factors and more than 20,000 genes. We introduce a computer program called CAYCE (Combinatorial AnalYsis of Cis-Elements) that systematically detects statistically overrepresented DNA motif association rules independent of Microarray information. CAYCE is an adaptation of the apriori algorithm traditionally used for association rule mining, but offers three significant advancements. (1) It analyzes multiple occurrences of an item, corresponding to multiple TF binding sites, (2) It compares results with a biologically relevant background, and (3), it provides p-values for straightforward statistical interpretation. CAYCE can be easily applied to any item-set data where the investigator is also interested in multiple occurrences of a single item, and/or overrepresentation of association rules compared with a background. Applying CAYCE to human promoters in 1% of the human genome, we discover that motif clusters containing five repetitions of SP1 are the most statistically significant.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Amino Acid Motifs / genetics
  • Binding Sites / genetics
  • Cell Line
  • Combinatorial Chemistry Techniques / statistics & numerical data*
  • DNA / genetics*
  • DNA / metabolism
  • GA-Binding Protein Transcription Factor / genetics
  • GA-Binding Protein Transcription Factor / metabolism
  • Humans
  • NF-E2-Related Factor 1 / genetics
  • NF-E2-Related Factor 1 / metabolism
  • Promoter Regions, Genetic*
  • Random Allocation
  • Sequence Analysis, DNA*
  • Sp1 Transcription Factor / genetics
  • Sp1 Transcription Factor / metabolism
  • Transcription Factors / chemistry
  • Transcription Factors / genetics*
  • Transcription Factors / metabolism

Substances

  • GA-Binding Protein Transcription Factor
  • NF-E2-Related Factor 1
  • Sp1 Transcription Factor
  • Transcription Factors
  • DNA