OutPredict: multiple datasets can improve prediction of expression and inference of causality

Jacopo Cirrone; Matthew D Brooks; Richard Bonneau; Gloria M Coruzzi; Dennis E Shasha

doi:10.1038/s41598-020-63347-3

OutPredict: multiple datasets can improve prediction of expression and inference of causality

Sci Rep. 2020 Apr 22;10(1):6804. doi: 10.1038/s41598-020-63347-3.

Authors

Jacopo Cirrone¹, Matthew D Brooks², Richard Bonneau^{3

2

4}, Gloria M Coruzzi², Dennis E Shasha³

Affiliations

¹ Courant Institute of Mathematical Sciences, Department of Computer Science, New York University, New York, NY, 10012, USA. cirrone@courant.nyu.edu.
² Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, 10003, USA.
³ Courant Institute of Mathematical Sciences, Department of Computer Science, New York University, New York, NY, 10012, USA.
⁴ Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, 10010, USA.

Abstract

The ability to accurately predict the causal relationships from transcription factors to genes would greatly enhance our understanding of transcriptional dynamics. This could lead to applications in which one or more transcription factors could be manipulated to effect a change in genes leading to the enhancement of some desired trait. Here we present a method called OutPredict that constructs a model for each gene based on time series (and other) data and that predicts gene's expression in a previously unseen subsequent time point. The model also infers causal relationships based on the most important transcription factors for each gene model, some of which have been validated from previous physical experiments. The method benefits from known network edges and steady-state data to enhance predictive accuracy. Our results across B. subtilis, Arabidopsis, E.coli, Drosophila and the DREAM4 simulated in silico dataset show improved predictive accuracy ranging from 40% to 60% over other state-of-the-art methods. We find that gene expression models can benefit from the addition of steady-state data to predict expression values of time series. Finally, we validate, based on limited available data, that the influential edges we infer correspond to known relationships significantly more than expected by chance or by state-of-the-art methods.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Computational Biology / methods*
Computer Simulation
Gene Expression Profiling / methods*
Gene Expression Profiling / statistics & numerical data
Gene Regulatory Networks*
Machine Learning
Models, Genetic*
Reproducibility of Results
Transcription Factors / genetics*

Substances

Transcription Factors

Abstract

Publication types

MeSH terms

Substances

Grants and funding