Many recent studies have emphasized the importance of genetic variants and mutations in cancer and other complex human diseases. The overwhelming majority of these variants occur in non-coding portions of the genome, where they can have a functional impact by disrupting regulatory interactions between transcription factors (TFs) and DNA. Here, we present a method for assessing the impact of non-coding mutations on TF-DNA interactions, based on regression models of DNA-binding specificity trained on high-throughput in vitro data. We use ordinary least squares (OLS) to estimate the parameters of the binding model for each TF, and we show that our predictions of TF-binding changes due to DNA mutations correlate well with measured changes in gene expression. In addition, by leveraging distributional results associated with OLS estimation, for each predicted change in TF binding we also compute a normalized score (z-score) and a significance value (p-value) reflecting our confidence that the mutation affects TF binding. We use this approach to analyze a large set of pathogenic non-coding variants, and we show that these variants lead to significant differences in TF binding between alleles, compared to a control set of common variants. Thus, our results indicate that there is a strong regulatory component to the pathogenic non-coding variants identified thus far.
Keywords: TF-DNA binding; non-coding variants; regression models.