Identification of High-Impact cis-Regulatory Mutations Using Transcription Factor Specific Random Forest Models

PLoS Comput Biol. 2015 Nov 12;11(11):e1004590. doi: 10.1371/journal.pcbi.1004590. eCollection 2015 Nov.


Cancer genomes contain vast amounts of somatic mutations, many of which are passenger mutations not involved in oncogenesis. Whereas driver mutations in protein-coding genes can be distinguished from passenger mutations based on their recurrence, non-coding mutations are usually not recurrent at the same position. Therefore, it is still unclear how to identify cis-regulatory driver mutations, particularly when chromatin data from the same patient is not available, thus relying only on sequence and expression information. Here we use machine-learning methods to predict functional regulatory regions using sequence information alone, and compare the predicted activity of the mutated region with the reference sequence. This way we define the Predicted Regulatory Impact of a Mutation in an Enhancer (PRIME). We find that the recently identified driver mutation in the TAL1 enhancer has a high PRIME score, representing a "gain-of-target" for MYB, whereas the highly recurrent TERT promoter mutation has a surprisingly low PRIME score. We trained Random Forest models for 45 cancer-related transcription factors, and used these to score variations in the HeLa genome and somatic mutations across more than five hundred cancer genomes. Each model predicts only a small fraction of non-coding mutations with a potential impact on the function of the encompassing regulatory region. Nevertheless, as these few candidate driver mutations are often linked to gains in chromatin activity and gene expression, they may contribute to the oncogenic program by altering the expression levels of specific oncogenes and tumor suppressor genes.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Binding Sites / genetics
  • Computational Biology / methods
  • Genome
  • HeLa Cells
  • Humans
  • Machine Learning
  • Models, Statistical*
  • Mutation / genetics*
  • Neoplasms / genetics*
  • Regulatory Sequences, Nucleic Acid / genetics*
  • Transcription Factors / genetics*


  • Transcription Factors

Grant support

This work was funded by FWO ( [G.0791.14 to S.A.]; Special Research Fund (BOF) KU Leuven ( [PF/10/016 to S.A.]; Foundation Against Cancer ( [2012-F2 to S.A.]. IWT PhD fellowship (to H.I.); Kom op Tegen Kanker (Stand up to Cancer), the Flemish cancer society post-doctoral fellowship (to Z.K.A.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.