Interpretable Machine Learning Models for Phase Prediction in Polymerization-Induced Self-Assembly

J Chem Inf Model. 2023 Jun 12;63(11):3288-3306. doi: 10.1021/acs.jcim.3c00460. Epub 2023 May 19.

Abstract

While polymerization-induced self-assembly (PISA) has become a preferred synthetic route toward amphiphilic block copolymer self-assemblies, predicting their phase behavior from experimental design is extremely challenging, requiring time and work-intensive creation of empirical phase diagrams whenever self-assemblies of novel monomer pairs are sought for specific applications. To alleviate this burden, we develop here the first framework for a data-driven methodology for the probabilistic modeling of PISA morphologies based on a selection and suitable adaption of statistical machine learning methods. As the complexity of PISA precludes generating large volumes of training data with in silico simulations, we focus on interpretable low variance methods that can be interrogated for conformity with chemical intuition and that promise to work well with only 592 training data points which we curated from the PISA literature. We found that among the evaluated linear models, generalized additive models, and rule and tree ensembles, all but the linear models show a decent interpolation performance with around 0.2 estimated error rate and 1 bit expected cross entropy loss (surprisal) when predicting the mixture of morphologies formed from monomer pairs already encountered in the training data. When considering extrapolation to new monomer combinations, the model performance is weaker but the best model (random forest) still achieves highly nontrivial prediction performance (0.27 error rate, 1.6 bit surprisal), which renders it a good candidate to support the creation of empirical phase diagrams for new monomers and conditions. Indeed, we find in three case studies that, when used to actively learn phase diagrams, the model is able to select a smart set of experiments that lead to satisfactory phase diagrams after observing only relatively few data points (5-16) for the targeted conditions. The data set as well as all model training and evaluation codes are publicly available through the GitHub repository of the last author.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Linear Models
  • Machine Learning*
  • Polymerization
  • Polymers / chemistry

Substances

  • Polymers