Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. May-Jun 2018;8(3):e1352.
doi: 10.1002/wcms.1352. Epub 2017 Dec 4.

In silico Toxicology: Comprehensive Benchmarking of Multi-Label Classification Methods Applied to Chemical Toxicity Data

Affiliations
Free PMC article
Review

In silico Toxicology: Comprehensive Benchmarking of Multi-Label Classification Methods Applied to Chemical Toxicity Data

Arwa B Raies et al. Wiley Interdiscip Rev Comput Mol Sci. .
Free PMC article

Abstract

One goal of toxicity testing, among others, is identifying harmful effects of chemicals. Given the high demand for toxicity tests, it is necessary to conduct these tests for multiple toxicity endpoints for the same compound. Current computational toxicology methods aim at developing models mainly to predict a single toxicity endpoint. When chemicals cause several toxicity effects, one model is generated to predict toxicity for each endpoint, which can be labor and computationally intensive when the number of toxicity endpoints is large. Additionally, this approach does not take into consideration possible correlation between the endpoints. Therefore, there has been a recent shift in computational toxicity studies toward generating predictive models able to predict several toxicity endpoints by utilizing correlations between these endpoints. Applying such correlations jointly with compounds' features may improve model's performance and reduce the number of required models. This can be achieved through multi-label classification methods. These methods have not undergone comprehensive benchmarking in the domain of predictive toxicology. Therefore, we performed extensive benchmarking and analysis of over 19,000 multi-label classification models generated using combinations of the state-of-the-art methods. The methods have been evaluated from different perspectives using various metrics to assess their effectiveness. We were able to illustrate variability in the performance of the methods under several conditions. This review will help researchers to select the most suitable method for the problem at hand and provide a baseline for evaluating new approaches. Based on this analysis, we provided recommendations for potential future directions in this area. This article is categorized under: 1Computer and Information Science > Chemoinformatics2Computer and Information Science > Computer Algorithms and Programming.

Figures

Figure 1
Figure 1
Illustrations of single‐label classification and multi‐label classification. X is the data set in which feature vectors describe compounds C 1Cn; n is the number of compounds; F 1Fm are features; m is the number of features. Y is the label vector (in single‐label classification) or the label matrix (in multi‐label classification). (a) Binary classification. (b) Multi‐class classification. (c) Multi‐label classification. Missing labels are denoted with ‘?.’ ‘1’ and ‘0’ are known labels.
Figure 2
Figure 2
Overview of modeling approaches. (a) Three categories of the computational methods including feature selection, multi‐label classification, and base classifiers. MLDT, multi‐label decision tree; MLKNN, multi‐label K nearest neighbors; MLC‐BMaD, multi‐label Boolean matrix decomposition. (b) A list of base classifiers along with their corresponding kernels, solvers, splitting criteria, and distance metrics (when applicable). CD, Coordinate Decent; CG, Conjugate Gradient; LBFGS, Memory‐limited quasi‐Newton; SAG, Stochastic Average Gradient; RBF, Radial Basis Function. (c) Three feature selection methods. L 1, L 2, and L 3: labels; X: the original feature set; X 1, X 2, X 3: selected feature sets for labels L 1, L 2, and L 3, respectively; xi: a single feature; Xs: the combined feature set; M 1, M 2, and M 3: models for endpoints L 1, L 2, and L 3, respectively; t: variance threshold.
Figure 3
Figure 3
Illustrations of some multi‐label classification methods. (a) X is the matrix of features of compounds C 1Cn, where n is the number of compounds, and their features F 1Fm, where m is the number of features. (b) L is the label matrix that consists of four labels in this example. Positive and negative labels are denoted by ‘1’ and ‘0’, respectively, while ‘?’ indicates missing labels. (c) Classifier chains method. Matrix X’ consists of the feature matrix X from part (a) extended with the label L 1 from matrix L, where L is from part (b). The missing labels of L 1 are imputed. X’ is used to train a model M to predict a second label, L 2. (d) Label powerset method. Matrix L’ consists of the transformed multi‐class labels. Each unique label combination is a distinct class. For example, l 1 indicates that L 1 is positive, while ~l 2 indicates that L 2 is negative. Missing labels are not encoded. (e) Random K labelset method. Matrix L’ consists of two labelsets of length K = 2, and each labelset is represented using the label powerset method. In this example, the first labelset consists of labels L 1 and L 2, and the second labelset consists of labels L 3 and L 4. (f) Multi‐label Boolean matrix decomposition method. L’ is the decomposed matrix that consists of three latent labels in this example: L’ 1, L’ 2, and L’ 3. (g) Matrix Y′ is the second matrix from the decomposition based on the multi‐label Boolean matrix decomposition method.
Figure 4
Figure 4
Data set description. (a) Toxicity profiles of 6644 compounds for 17 toxicity endpoints. Each row corresponds to a compound, each column corresponds to a toxicity endpoint, and each cell represents a compound's activity per endpoint. Compounds are numbered from 0 to 6643. Red cells indicate active/toxic compounds, while blue cells indicate inactive/nontoxic compounds. However, gray cells denote the unknown toxicity. (b) A bar graph of the number of toxic and nontoxic compounds associated with each toxicity endpoint. (c) A bar graph of the number of known toxicity effects per compound. (d) A bar graph of the percentage of positive and negative toxicity effects per compound.
Figure 5
Figure 5
Comparison of macro‐average performances of multi‐label and binary relevance models in (a) internal and (b) external validation. Bar graphs show models performance via five metrics: accuracy, F1‐score, precision, recall, and specificity. Models are numbered from 0 to 19,185. The gray areas in bar graphs show the performance range of binary relevance models. BR, binary relevance; CC, classifier chains; LP, label powerset; MLC‐BMad, multi‐label Boolean matrix decomposition; MLDT, multi‐label decision tree; MLKNN, multi‐label K nearest neighbors; RAkEL: random K labelset.
Figure 6
Figure 6
Comparison of macro‐average performances of models in internal and external validations. The scatter plots demonstrate models performances via five metrics: accuracy, F1‐score, precision, recall, and specificity. The x‐axis and y‐axis show model performances in internal and external validation, respectively. The closer the models are to the diagonal (from (0,0) point to (1) point)) of the scatter plots, the more similar is their performance in internal and external validations. However, models that have high variability between internal and external performance appear below or above the diagonal region and are marked in orange and blue, respectively.
Figure 7
Figure 7
Accuracy scores per endpoint of the top‐ranked models generated by each multi‐label classification method and top ranked binary relevance model in (a) internal and (b) external validation. Rows correspond to the multi‐label classification methods and the binary relevance method. Column corresponds to endpoints. Each cell shows the accuracy scores of each method per endpoint. The scores range from 0.0 (worst performance) to 1.0 (best performance). BR, binary relevance; CC, classifier chains; DL, deep learning; LP, label powerset; MLC‐BMaD, multi‐label Boolean matrix decomposition; MLDT, multi‐label decision tree; MLKNN, multi‐label K nearest neighbor; RAkEL, random K labelset; SSL, semi‐supervised learning.
Figure 8
Figure 8
Area Under Receiver Operating Characteristics curve (AUROC) scores of the top‐ranked models generated by each multi‐label classification method and the binary relevance method per endpoint in (a) internal and (b) external validation. Rows correspond to the multi‐label classification methods and the binary relevance method. Column corresponds to endpoints. Each cell shows the AUROC scores of each method per endpoint. The scores range from 0.0 (worst performance) to 1.0 (best performance). AUROC scores of 0.5 indicate random predictions. BR, binary relevance; CC, classifier chains; DL, deep learning; LP, label powerset; MLC‐BMaD, multi‐label Boolean matrix decomposition; MLDT, multi‐label decision tree; MLKNN, multi‐label K nearest neighbor; RAkEL, random K labelset; SSL, semi‐supervised learning.
Figure 9
Figure 9
Performance of estimating the toxicity of a given endpoint using average toxicity values of other endpoints in (a) internal and (b) external validation. Each row corresponds to a performance metric, and each column corresponds to an endpoint. Each cell shows the calculated scores per endpoint. The scores range from 0.0 (worst performance) to 1.0 (best performance).
Figure 10
Figure 10
Predictability of endpoints in (a) internal and (b) external validation. The heat maps show models’ performances in predicting each toxicity endpoint. Each row corresponds to a model, and each column corresponds to a toxicity endpoint. Cells represent model's performance in predicting each endpoint. Models are numbered from 0 to 19,185. The performance is calculated using mean absolute error metric and ranges from 0.0 (best performance) to 1.0 (worst performance). The endpoints were clustered according to models’ performances in predicting the endpoints into two clusters: endpoints with high predictability (green clusters) and endpoints with low predictability (orange clusters).
Figure 11
Figure 11
Predictability of compounds’ toxicity in (a) internal and (b) external validations. The heat maps show models performances in predicting the toxicity of each compound. Each row corresponds to a model, and each column corresponds to a compound. Cells represent each model's performance in predicting the toxicity of each compound. Models are numbered from 0 to 19,185. The performance is calculated using mean absolute error metric and ranges from 0.0 (best performance) to 1.0 (worst performance). The compounds were clustered into three groups according to models’ performances in predicting the compounds toxicities: compounds with high predictability (green clusters), compounds with medium predictability (magenta clusters), and compounds with low predictability (orange clusters).
Figure 12
Figure 12
The relationship between compounds predictability and the number of known toxicity effects per compound. The histograms show the probability distribution of the number of known toxicity endpoints per compound for compounds with high, medium, and low predictability in (a) internal and (b) external validations.
Figure 13
Figure 13
Effect of feature selection on models’ performance in (a) internal and (b) external validation. Bar graphs show models’ macro‐average performance via five metrics: accuracy, F1‐score, precision, recall, and specificity. Models are numbered from 0 to 19,185. SFS, supervised feature selection; UFS, unsupervised feature selection; LSFS, label‐specific feature selection, None, no feature selection method is applied.

Similar articles

See all similar articles

Cited by 1 article

References

FURTHER READING

    1. Appice A, editor; , Ceci M, editor; , Loglisci C, editor; , Manco G, editor; , Masciari E, editor; , Ras Z, editor. , eds. New Frontiers in Mining Complex Patterns. 1st ed. Cham, Switzerland: Springer; 2014.
    1. Pardalos PM, editor; , Boginski VL, editor; , Alkis V, editor. , eds. Data Mining in Biomedicine. 1st ed. Gainesville, FL: Springer; 2007.
    1. Dijkstra TMH, editor; , Tsivtsivadze E, editor; , Marchiori E, editor; , Heskes T, editor. , eds. Pattern Recognition in Bioinformatics. 1st ed. Berlin and Heidelberg, Germany: Springer; 2010.
    1. Wang JTL, editor; , Zaki MJ, editor; , Toivonen H, editor; , Shasha D, editor. , eds. Data Mining in Bioinformatics. 1st ed. London, UK: Springer‐Verlag; 2005.

References

    1. Arome D, Chinedu E. The importance of toxicity testing. J Pharm Bio Sci 2013, 4:146–148.
    1. Parasuraman S. Toxicological screening. J Pharmacol Pharmacother 2011, 2:74–79. - PMC - PubMed
    1. Auletta AE, Dearfield KL, Cimino MC. Mutagenicity test schemes and guidelines: U.S. EPA office of pollution prevention and toxics and office of pesticide programs. Environ Mol Mutagen 1993, 21:38–45. - PubMed
    1. Rulis AM, Hattan DG. FDA's priority‐based assessment of food additives: II general toxicity parameters. Regul Toxicol Pharmacol 1985, 5:152–174. - PubMed
    1. Shukla SJ, Huang R, Austin CP, Xia M. The future of toxicity testing: a focus on in vitro methods using a quantitative high‐throughput screening platform. Drug Discov Today 2010, 15:997–1007. - PMC - PubMed

LinkOut - more resources

Feedback