Benchmarking DNA binding affinity models using allele-specific transcription factor binding data

Xiaoting Li; Lucas A N Melo; Harmen J Bussemaker

doi:10.1101/2023.12.15.571887

Benchmarking DNA binding affinity models using allele-specific transcription factor binding data

bioRxiv [Preprint]. 2023 Dec 15:2023.12.15.571887. doi: 10.1101/2023.12.15.571887.

Authors

Xiaoting Li¹, Lucas A N Melo¹, Harmen J Bussemaker^{1

2}

Affiliations

¹ Department of Biological Sciences, Columbia University, New York, NY 10027, USA.
² Department of Systems Biology, Columbia University, New York, NY 10032, USA.

Abstract

Transcription factors (TFs) bind to DNA in a highly sequence-specific manner. This specificity can manifest itself in vivo at heterozygous loci as a difference in TF occupancy between the two alleles. When applied on a genomic scale, functional genomic assays such as ChIP-seq typically lack the statistical power to detect allele-specific binding (ASB) at the level of individual variants. To address this, we propose a framework for benchmarking sequence-to-affinity models for TF binding in terms of their ability to predict allelic imbalances in ChIP-seq counts. We show that a likelihood function based on an over-dispersed binomial distribution can aggregate evidence for allelic preference across the genome without requiring statistical significance for individual variants. This allows us to systematically compare predictive performance when multiple binding models for the same TF are available. We introduce PyProBound, an easily extensible reimplementation of the ProBound biophysically interpretable machine learning framework. Configuring PyProBound to explicitly account for a confounding sequence-specific bias in DNA fragmentation rate yields improved TF binding models when training on ChIP-seq data. We also show how our likelihood function can be leveraged to perform de novo motif discovery on the raw allele-aware ChIP-seq counts.

Keywords: CTCF; ChIP-seq; DNA binding specificity; Gene expression regulation; allele-specific binding; biophysically interpretable machine learning; motif discovery; non-coding genetic variation; statistical modeling; transcription factors.

Publication types

Preprint

Grants and funding

R01 MH106842/MH/NIMH NIH HHS/United States