TGPred: efficient methods for predicting target genes of a transcription factor by integrating statistics, machine learning and optimization

Xuewei Cao; Ling Zhang; Md Khairul Islam; Mingxia Zhao; Cheng He; Kui Zhang; Sanzhen Liu; Qiuying Sha; Hairong Wei

doi:10.1093/nargab/lqad083

TGPred: efficient methods for predicting target genes of a transcription factor by integrating statistics, machine learning and optimization

NAR Genom Bioinform. 2023 Sep 13;5(3):lqad083. doi: 10.1093/nargab/lqad083. eCollection 2023 Sep.

Authors

Xuewei Cao¹, Ling Zhang^{2

3}, Md Khairul Islam^{2

3}, Mingxia Zhao⁴, Cheng He⁴, Kui Zhang¹, Sanzhen Liu⁴, Qiuying Sha¹, Hairong Wei^{1

2

3}

Affiliations

¹ Department of Mathematical Sciences, Michigan Technological University, Houghton, MI 49931, USA.
² Computational Science and Engineering Program, Michigan Technological University, Houghton, MI 49931, USA.
³ College of Forest Resources and Environmental Science, Michigan Technological University, Houghton, MI 49931, USA.
⁴ Department of Plant Pathology, Kansas State University, Manhattan, KS 66506, USA.

Abstract

Four statistical selection methods for inferring transcription factor (TF)-target gene (TG) pairs were developed by coupling mean squared error (MSE) or Huber loss function, with elastic net (ENET) or least absolute shrinkage and selection operator (Lasso) penalty. Two methods were also developed for inferring pathway gene regulatory networks (GRNs) by combining Huber or MSE loss function with a network (Net)-based penalty. To solve these regressions, we ameliorated an accelerated proximal gradient descent (APGD) algorithm to optimize parameter selection processes, resulting in an equally effective but much faster algorithm than the commonly used convex optimization solver. The synthetic data generated in a general setting was used to test four TF-TG identification methods, ENET-based methods performed better than Lasso-based methods. Synthetic data generated from two network settings was used to test Huber-Net and MSE-Net, which outperformed all other methods. The TF-TG identification methods were also tested with SND1 and gl3 overexpression transcriptomic data, Huber-ENET and MSE-ENET outperformed all other methods when genome-wide predictions were performed. The TF-TG identification methods fill the gap of lacking a method for genome-wide TG prediction of a TF, and potential for validating ChIP/DAP-seq results, while the two Net-based methods are instrumental for predicting pathway GRNs.