Integration of multiple types of genetic markers for neuroblastoma may contribute to improved prediction of the overall survival

Biol Direct. 2018 Sep 20;13(1):17. doi: 10.1186/s13062-018-0222-9.

Abstract

Background: Modern experimental techniques deliver data sets containing profiles of tens of thousands of potential molecular and genetic markers that can be used to improve medical diagnostics. Previous studies performed with three different experimental methods for the same set of neuroblastoma patients create opportunity to examine whether augmenting gene expression profiles with information on copy number variation can lead to improved predictions of patients survival. We propose methodology based on comprehensive cross-validation protocol, that includes feature selection within cross-validation loop and classification using machine learning. We also test dependence of results on the feature selection process using four different feature selection methods.

Results: The models utilising features selected based on information entropy are slightly, but significantly, better than those using features obtained with t-test. The synergy between data on genetic variation and gene expression is possible, but not confirmed. A slight, but statistically significant, increase of the predictive power of machine learning models has been observed for models built on combined data sets. It was found while using both out of bag estimate and in cross-validation performed on a single set of variables. However, the improvement was smaller and non-significant when models were built within full cross-validation procedure that included feature selection within cross-validation loop. Good correlation between performance of the models in the internal and external cross-validation was observed, confirming the robustness of the proposed protocol and results.

Conclusions: We have developed a protocol for building predictive machine learning models. The protocol can provide robust estimates of the model performance on unseen data. It is particularly well-suited for small data sets. We have applied this protocol to develop prognostic models for neuroblastoma, using data on copy number variation and gene expression. We have shown that combining these two sources of information may increase the quality of the models. Nevertheless, the increase is small and larger samples are required to reduce noise and bias arising due to overfitting.

Reviewers: This article was reviewed by Lan Hu, Tim Beissbarth and Dimitar Vassilev.

Keywords: Copy number variation; Feature selection; Gene expression; Machine learning; Neuroblastoma; Random forest; Synergy.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Artificial Intelligence
  • DNA Copy Number Variations / genetics
  • Genetic Markers / genetics*
  • Humans
  • Machine Learning
  • Neuroblastoma / genetics*
  • Neuroblastoma / pathology*

Substances

  • Genetic Markers