How do data-mining models consider arsenic contamination in sediments and variables importance?

Environ Monit Assess. 2019 Nov 28;191(12):777. doi: 10.1007/s10661-019-7979-x.


Arsenic (As) is one of the most important dangerous elements as more than 100 million of people are exposed to risk, globally. The permissible threshold of As for drinking water is 10 μg/L according to both the WHO's drinking water guidelines and the Iranian national standard. However, several studies have indicated that As concentrations exceed this threshold value in several regions of Iran. This research evaluates an As-susceptible region, the Tajan River watershed, using the following data-mining models: multivariate adaptive regression splines (MARS), functional data analysis (FDA), support vector machine (SVM), generalized linear model (GLM), multivariate discriminant analysis (MDA), and gradient boosting machine (GBM). This study considers 12 factors for elevated As concentrations: land use, drainage density, profile curvature, plan curvature, slope length, slope degree, topographic wetness index, erosion, village density, distance from villages, precipitation, and lithology. The susceptibility mapping was conducted using training (70%) and validation (30%). The results of As contamination in sediment showed that classifications into 4 levels of concentration are very similar for two models of GLM and FDA. The GBM calculated the areas of highest arsenic contamination risk by MARS and SVM with percentages of 30.0% and 28.7%, respectively. FDA, GLM, MARS, and MDA models calculated the areas of lowest risk to be 3.3%, 23.0%, 72.0%, 25.2%, and 26.1%, respectively. The results of ROC curve reveal that the MARS, SVM, and MDA had the highest accuracies with area under the curve ROC values of 84.6%, 78.9%, and 79.5%, respectively. Land use, lithology, erosion, and elevation were the most important predictors of contamination potential with a value of 0.6, 0.59, 0.57, and 0.56, respectively. These are the most important factors. Finally, these data-mining methods can be used as appropriate, inexpensive, and feasible options to identify As-susceptible areas and can guide managers to reduce contamination in sediment of the environment and the food chain.

Keywords: Arsenic; Data-mining; GIS-based mapping; Human health; Iran; LVQ.

MeSH terms

  • Arsenic* / analysis
  • Data Mining*
  • Drinking Water / analysis
  • Drinking Water / standards
  • Environmental Monitoring* / methods
  • Environmental Pollutants* / analysis
  • Geologic Sediments* / chemistry
  • Iran
  • Models, Theoretical*
  • ROC Curve


  • Drinking Water
  • Environmental Pollutants
  • Arsenic