A heuristic method for simulating open-data of arbitrary complexity that can be used to compare and evaluate machine learning methods

Pac Symp Biocomput. 2018:23:259-267.


A central challenge of developing and evaluating artificial intelligence and machine learning methods for regression and classification is access to data that illuminates the strengths and weaknesses of different methods. Open data plays an important role in this process by making it easy for computational researchers to easily access real data for this purpose. Genomics has in some examples taken a leading role in the open data effort starting with DNA microarrays. While real data from experimental and observational studies is necessary for developing computational methods it is not sufficient. This is because it is not possible to know what the ground truth is in real data. This must be accompanied by simulated data where that balance between signal and noise is known and can be directly evaluated. Unfortunately, there is a lack of methods and software for simulating data with the kind of complexity found in real biological and biomedical systems. We present here the Heuristic Identification of Biological Architectures for simulating Complex Hierarchical Interactions (HIBACHI) method and prototype software for simulating complex biological and biomedical data. Further, we introduce new methods for developing simulation models that generate data that specifically allows discrimination between different machine learning methods.

Publication types

  • Comparative Study
  • Evaluation Study

MeSH terms

  • Algorithms
  • Artificial Intelligence
  • Computational Biology / methods
  • Computer Simulation
  • Databases, Genetic / statistics & numerical data
  • Genetic Association Studies / statistics & numerical data
  • Heuristics
  • Humans
  • Machine Learning* / statistics & numerical data
  • Models, Genetic
  • Software
  • Stochastic Processes
  • Systems Biology