The presence of lead in drinking water creates a public health crisis, as lead causes neurological damage at low levels of exposure. The objective of this research is to explore modeling approaches to predict the risk of lead at private drinking water systems. This research uses Bayesian Network approaches to explore interactions among household characteristics, geological parameters, observations of tap water, and laboratory tests of water quality parameters. A knowledge discovery framework is developed by integrating methods for data discretization, feature selection, and Bayes classifiers. Forward selection and backward selection are explored for feature selection. Discretization approaches, including domain-knowledge, statistical, and information-based approaches, are tested to discretize continuous features. Bayes classifiers that are tested include General Bayesian Network, Naive Bayes, and Tree-Augmented Naive Bayes, which are applied to identify Directed Acyclic Graphs (DAGs). Bayesian inference is used to fit conditional probability tables for each DAG. The Bayesian framework is applied to fit models for a dataset collected by the Virginia Household Water Quality Program (VAHWQP), which collected water samples and conducted household surveys at 2,146 households that use private water systems, including wells and springs, in Virginia during 2012 and 2013. Relationships among laboratory-tested water quality parameters, observations of tap water, and household characteristics, including plumbing type, source water, household location, and on-site water treatment are explored to develop features for predicting water lead levels. Results demonstrate that Naive Bayes classifiers perform best based on recall and precision, when compared with other classifiers. Copper is the most significant predictor of lead, and other important predictors include county, pH, and on-site water treatment. Feature selection methods have a marginal effect on performance, and discretization methods can greatly affect model performance when paired with classifiers. Owners of private wells remain disadvantaged and may be at an elevated level of risk, because utilities and governing agencies are not responsible for ensuring that lead levels meet the Lead and Copper Rule for private wells. Insight gained from models can be used to identify water quality parameters, plumbing characteristics, and household variables that increase the likelihood of high water lead levels to inform decisions about lead testing and treatment.
Keywords: Bayesian Belief Network; Contamination Classification; Lead in Drinking Water; Water Quality.
Copyright © 2020 Elsevier Ltd. All rights reserved.