Benchmarking DNA binding affinity models using allele-specific transcription factor binding data

bioRxiv [Preprint]. 2023 Dec 15:2023.12.15.571887. doi: 10.1101/2023.12.15.571887.

Abstract

Transcription factors (TFs) bind to DNA in a highly sequence-specific manner. This specificity can manifest itself in vivo at heterozygous loci as a difference in TF occupancy between the two alleles. When applied on a genomic scale, functional genomic assays such as ChIP-seq typically lack the statistical power to detect allele-specific binding (ASB) at the level of individual variants. To address this, we propose a framework for benchmarking sequence-to-affinity models for TF binding in terms of their ability to predict allelic imbalances in ChIP-seq counts. We show that a likelihood function based on an over-dispersed binomial distribution can aggregate evidence for allelic preference across the genome without requiring statistical significance for individual variants. This allows us to systematically compare predictive performance when multiple binding models for the same TF are available. We introduce PyProBound, an easily extensible reimplementation of the ProBound biophysically interpretable machine learning framework. Configuring PyProBound to explicitly account for a confounding sequence-specific bias in DNA fragmentation rate yields improved TF binding models when training on ChIP-seq data. We also show how our likelihood function can be leveraged to perform de novo motif discovery on the raw allele-aware ChIP-seq counts.

Keywords: CTCF; ChIP-seq; DNA binding specificity; Gene expression regulation; allele-specific binding; biophysically interpretable machine learning; motif discovery; non-coding genetic variation; statistical modeling; transcription factors.

Publication types

  • Preprint