Systematic Investigation of Error Distribution in Machine Learning Algorithms Applied to the Quantum-Chemistry QM9 Data Set Using the Bias and Variance Decomposition

Luis Cesar de Azevedo; Gabriel A Pinheiro; Marcos G Quiles; Juarez L F Da Silva; Ronaldo C Prati

doi:10.1021/acs.jcim.1c00503

Systematic Investigation of Error Distribution in Machine Learning Algorithms Applied to the Quantum-Chemistry QM9 Data Set Using the Bias and Variance Decomposition

J Chem Inf Model. 2021 Sep 27;61(9):4210-4223. doi: 10.1021/acs.jcim.1c00503. Epub 2021 Aug 13.

Authors

Luis Cesar de Azevedo¹, Gabriel A Pinheiro², Marcos G Quiles², Juarez L F Da Silva³, Ronaldo C Prati¹

Affiliations

¹ Center of Mathematics, Computation and Cognition, Federal University of ABC, Av. dos Estados, 5001, 09210-580 Santo André, SP, Brazil.
² Institute of Science and Technology, Federal University of São Paulo (Unifesp), 12247-014 São José dos Campos, SP, Brazil.
³ São Carlos Institute of Chemistry, University of São Paulo, PO Box 780, 13560-970 São Carlos, SP, Brazil.

PMID: 34387994
DOI: 10.1021/acs.jcim.1c00503

Abstract

Most machine learning applications in quantum-chemistry (QC) data sets rely on a single statistical error parameter such as the mean square error (MSE) to evaluate their performance. However, this approach has limitations or can even yield incorrect interpretations. Here, we report a systematic investigation of the two components of the MSE, i.e., the bias and variance, using the QM9 data set. To this end, we experiment with three descriptors, namely (i) symmetry functions (SF, with two-body and three-body functions), (ii) many-body tensor representation (MBTR, with two- and three-body terms), and (iii) smooth overlap of atomic positions (SOAP), to evaluate the prediction process's performance using different numbers of molecules in training samples and the effect of bias and variance on the final MSE. Overall, low sample sizes are related to higher MSE. Moreover, the bias component strongly influences the larger MSEs. Furthermore, there is little agreement among molecules with higher errors (outliers) across different descriptors. However, there is a high prevalence among the outliers intersection set and the convex hull volume of geometric coordinates (VCH). According to the obtained results with the distribution of MSE (and its components bias and variance) and the appearance of outliers, it is suggested to use ensembles of models with a low bias to minimize the MSE, more specifically when using a small number of molecules in the training set.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Bias
Machine Learning*