Study becomes insight: Ecological learning from machine learning

Methods Ecol Evol. 2021 Nov;12(11):2117-2128. doi: 10.1111/2041-210X.13686. Epub 2021 Aug 6.

Abstract

The ecological and environmental science communities have embraced machine learning (ML) for empirical modelling and prediction. However, going beyond prediction to draw insights into underlying functional relationships between response variables and environmental 'drivers' is less straightforward. Deriving ecological insights from fitted ML models requires techniques to extract the 'learning' hidden in the ML models.We revisit the theoretical background and effectiveness of four approaches for deriving insights from ML: ranking independent variable importance (Gini importance, GI; permutation importance, PI; split importance, SI; and conditional permutation importance, CPI), and two approaches for inference of bivariate functional relationships (partial dependence plots, PDP; and accumulated local effect plots, ALE). We also explore the use of a surrogate model for visualization and interpretation of complex multi-variate relationships between response variables and environmental drivers. We examine the challenges and opportunities for extracting ecological insights with these interpretation approaches. Specifically, we aim to improve interpretation of ML models by investigating how effectiveness relates to (a) interpretation algorithm, (b) sample size and (c) the presence of spurious explanatory variables.We base the analysis on simulations with known underlying functional relationships between response and predictor variables, with added white noise and the presence of correlated but non-influential variables. The results indicate that deriving ecological insight is strongly affected by interpretation algorithm and spurious variables, and moderately impacted by sample size. Removing spurious variables improves interpretation of ML models. Meanwhile, increasing sample size has limited value in the presence of spurious variables, but increasing sample size does improves performance once spurious variables are omitted. Among the four ranking methods, SI is slightly more effective than the other methods in the presence of spurious variables, while GI and SI yield higher accuracy when spurious variables are removed. PDP is more effective in retrieving underlying functional relationships than ALE, but its reliability declines sharply in the presence of spurious variables. Visualization and interpretation of the interactive effects of predictors and the response variable can be enhanced using surrogate models, including three-dimensional visualizations and use of loess planes to represent independent variable effects and interactions.Machine learning analysts should be aware that including correlated independent variables in ML models with no clear causal relationship to response variables can interfere with ecological inference. When ecological inference is important, ML models should be constructed with independent variables that have clear causal effects on response variables. While interpreting ML models for ecological inference remains challenging, we show that careful choice of interpretation methods, exclusion of spurious variables and adequate sample size can provide more and better opportunities to 'learn from machine learning'.

机器学习已经被广泛应用于生态和环境科学中进行经验建模和预测, 然而我们并不能直接有效地从机器学习模型中获取重要的生态机制关系。因此从机器学习模型中获取生态机制关系亟需有效的方法提取模型中的潜在关系。在本研究中, 我们主要考察了以下相关方法的理论背景和有效性。其中四种方法用来估计环境解释变量的重要性, 包括: 基尼系数, 排列特征重要性, 基于增长回归树模型分裂树的特征重要性, 以及条件排列重要性。两种方法来获取解释变量与目标变量之间的功能关系: 部份依赖图和累积局部效应。此外, 我们采用一个代理模型来可视化并解释复杂的多变量关系。本文旨在检验应用以上方法获取生态解释的机遇和挑战, 尤其是他们受样本大小和冗余特征变量的影响。本研究基于一组已知潜在生态机制的全球物种丰富度的模拟数据, 并添加白噪声和相关但没有直接因果关系的冗余特征。研究结果显示从机器学习中获取生态机制最主要受解释方法以及冗余特征的影响, 其次是样本大小。剔除冗余变量可以显著提高模型的生态解释能力; 并且当冗余变量去除后, 增大样本大小可以提高解释效果。在四个特征重要性排列的方法中, 当模型中存在冗余特征时, 分裂特征重要性比其他方法稍显优势; 当冗余特征去除后, 基尼系数和分裂特征重要性都能获得更准确的结果。在任何情况下, 部份依赖图比都比累计局部效应更有效提取功能关系; 冗余特征的存在同样影响了部份依赖图的有效性。应用三维可视化和局部多项式回归的组合作为代理模型可以有效表达多个环境特征的交互作用。本研究结果显示我们需要关注冗余变量对获取生态机制的影响。如果我们需要从机器学习模型中获取潜在的生态机制, 机器学习模型中最好只包括与反应变量有清晰因果关系的特征变量。从机器学习中获取生态机制解释一直是一个关键的挑战, 本研究显示应用合适的解释方法, 剔除冗余变量以及应用充足的样本量可以显著提高从机器学习中“学习”的机会。.

Keywords: bivariate functional relationship; boosted regression tree (BRT); ecological inference; interpretation of machine learning models; random forest (RF); variable importance.