Simultaneous feature engineering and interpretation: Forecasting harmful algal blooms using a deep learning approach

Water Res. 2022 May 15:215:118289. doi: 10.1016/j.watres.2022.118289. Epub 2022 Mar 12.

Abstract

Routine monitoring for harmful algal blooms (HABs) is generally undertaken at low temporal frequency (e.g., weekly to monthly) that is unsuitable for capturing highly dynamic variations in cyanobacteria abundance. Therefore, we developed a model incorporating reverse time attention with a decay mechanism (RETAIN-D) to forecast HABs with simultaneous improvements in temporal resolution, forecasting performance, and interpretability. The usefulness of RETAIN-D in forecasting HABs was illustrated by its application to two sites located in the lower sections of the Nakdong and Yeongsan rivers, South Korea, where HABs pose a critical water quality issue. Three variations of recurrent neural network models, i.e., long short-term memory (LSTM), gated recurrent unit (GRU), and reverse time attention (RETAIN), were adopted for comparisons of performance with RETAIN-D. Input features encompassing meteorological, hydrological, environmental, and biological factors were used to forecast cyanobacteria abundance (total cyanobacteria cell counts and cell counts of dominant cyanobacteria taxa). Incorporation of a decay mechanism into the deep learning structure in RETAIN-D allowed forecasts of HABs on a high temporal resolution (daily) without manual feature engineering, increasing the usefulness of resulting forecasts for water quality and resources management. RETAIN-D yielded a high degree of accuracy (RMSE = 0.29-1.67, R2 = 0.76-0.98, MAE = 0.18-1.14, SMAPE = 9.77-87.94% for test sets; on natural log scales) across model outputs and sites, successfully capturing high variability and irregularities in the time series. RETAIN-D showed higher accuracy than RETAIN (except for comparable accuracy in forecasting Microcystis abundance at the Nakdong River site) and outperformed LSTM and GRU across all model outputs and sites. Ambient temperature had high importance in forecasting cyanobacteria abundance across all model outputs and sites, whereas the relative importance of other input features varied by the output and site. Increases in contributions with increasing irradiance, decreasing flow rates, and increasing residence time were more pronounced in summer than other seasons. Differences in the contributions of input features among different time steps (1 to 7 days prior to forecasting) were larger in the Yeongsan River site. RETAIN-D is applicable to a wide range of forecasting models that can benefit from improved temporal resolution, performance, and interpretability.

Keywords: Cyanobacteria; Decay mechanism; Explainable artificial intelligence; Harmful algal bloom; Recurrent neural network; Reverse time attention mechanism.

MeSH terms

  • Cyanobacteria*
  • Deep Learning*
  • Harmful Algal Bloom
  • Rivers
  • Water Quality