Machine learning-based prediction of effluent total suspended solids in a wastewater treatment plant using different feature selection approaches: A comparative study

Environ Res. 2024 Apr 1:246:118146. doi: 10.1016/j.envres.2024.118146. Epub 2024 Jan 11.

Abstract

Accurately predicting the characteristics of effluent, discharged from wastewater treatment plants (WWTPs) is crucial for reducing sampling requirements, labor, costs, and environmental pollution. Machine learning (ML) techniques can be effective in achieving this goal. To optimize ML-based models, various feature selection (FS) methods are employed. This study aims to investigate the impact of six FS methods (categorized as Wrapper, Filter, and Embedded methods) on the accuracy of three supervised ML algorithms in predicting total suspended solids (TSS) concentration in the effluent of a municipal wastewater treatment plant. Based on the features proposed by each FS method, five distinct scenarios were defined. Within each scenario, three ML algorithms, namely artificial neural network-multi layer perceptron (ANN-MLP), K-nearest neighbors (KNN), and adaptive boosting (AdaBoost) were applied. The features utilized for predicting TSS concentration in the WWTP effluent included BOD5, COD, TSS, TN, NH3 in the influent, and BOD5, COD, residual Cl2, NO3, TN, NH4 in the effluent. To construct the models, the dataset was randomly divided into training and testing subsets, and K-fold cross-validation was employed to control overfitting and underfitting. The evaluation metrics that are used are root mean squared error (RMSE), mean absolute error (MAE), and correlation coefficient (R2). The most efficient scenario was identified as Scenario IV, with the Sequential Backward Selection FS method. The features selected by this method were CODe, BOD5e, BOD5i, TNi. Furthermore, the ANN-MLP algorithm demonstrated the best performance, achieving the highest R2 value. This algorithm exhibited acceptable performance in both the training and testing subsets (R2 = 0.78 and R2 = 0.8, respectively).

Keywords: ANN; AdaBoost; Feature selection; KNN; Machine learning; Total suspended solids.

MeSH terms

  • Algorithms
  • Machine Learning
  • Neural Networks, Computer
  • Waste Disposal, Fluid* / methods
  • Water Purification* / methods