Motivation: 'Phylogenetic footprinting' is a widely applied approach to identify regulatory regions and potential transcription factor binding sites (TFBSs) using alignments of non-coding orthologous regions from two or more organisms. A systematic evaluation of its validity and usability based on known TFBSs is needed to use phylogenetic footprinting most effectively in the identification of unknown TFBSs.
Results: In this paper we use 2678 human, mouse and rat TFBSs from the TRANSFAC database for this evaluation. To ensure the retrieval of correct orthologous sequences, we combine gene annotation and sequence homology searches. Demanding a sequence identity of at least 65% is most effective in discriminating TFBSs from non-functional sequence parts, while different alignment algorithms only have a minor influence on TFBS identification by human-rodent comparisons. With this threshold approximately 72% of the known TFBSs are found conserved, a number which varies significantly between different transcription factors and also depends on the function of the regulated gene. TFBSs for certain transcription factors do not require strict sequence conservation but instead may show a high pattern conservation, limiting somewhat the validity of purely sequence-based phylogenetic footprinting.