Skip to main page content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
, 50 (7), 1189-204

Trust, but Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research

Affiliations

Trust, but Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research

Denis Fourches et al. J Chem Inf Model.

Abstract

Molecular modelers and cheminformaticians typically analyze experimental data generated by other scientists. Consequently, when it comes to data accuracy, cheminformaticians are always at the mercy of data providers who may inadvertently publish (partially) erroneous data. Thus, dataset curation is crucial for any cheminformatics analysis such as similarity searching, clustering, QSAR modeling, virtual screening, etc., especially nowadays when the availability of chemical datasets in public domain has skyrocketed in recent years. Despite the obvious importance of this preliminary step in the computational analysis of any dataset, there appears to be no commonly accepted guidance or set of procedures for chemical data curation. The main objective of this paper is to emphasize the need for a standardized chemical data curation strategy that should be followed at the onset of any molecular modeling investigation. Herein, we discuss several simple but important steps for cleaning chemical records in a database including the removal of a fraction of the data that cannot be appropriately handled by conventional cheminformatics techniques. Such steps include the removal of inorganic and organometallic compounds, counterions, salts and mixtures; structure validation; ring aromatization; normalization of specific chemotypes; curation of tautomeric forms; and the deletion of duplicates. To emphasize the importance of data curation as a mandatory step in data analysis, we discuss several case studies where chemical curation of the original “raw” database enabled the successful modeling study (specifically, QSAR analysis) or resulted in a significant improvement of model's prediction accuracy. We also demonstrate that in some cases rigorously developed QSAR models could be even used to correct erroneous biological data associated with chemical compounds. We believe that good practices for curation of chemical records outlined in this paper will be of value to all scientists working in the fields of molecular modeling, cheminformatics, and QSAR studies.

Figures

Figure 1
Figure 1
General dataset curation workflow.
Figure 2
Figure 2
Descriptor calculation for three organometallic compounds using DRAGON, MOE, ISIDA and HiT QSAR software.
Figure 3
Figure 3
Structure normalization: five types of nitro group representations retrieved in the nitroaromatics dataset for rats and T. pyriformis case studies (see section 3.2 in the text for details).
Figure 4
Figure 4
Use of ChemAxon Standardizer to normalize three compounds possessing the sydnone chemotype (see text for details).
Figure 5
Figure 5
Examples of misleading structure representations produced by the “general style” option available in ChemAxon Standardizer, which may serve as a potential source of errors for programs calculating molecular descriptors.
Figure 6
Figure 6
Automatic retrieval of structural duplicates using the ISIDA/Duplicates program: example of stereoisomers (Ames mutagenicity dataset) with opposite mutagenicity properties.
Figure 7
Figure 7
Two structural isomers retrieved as either duplicates or non-duplicates by ISIDA/Duplicates and HiT QSAR according to different pools of chemical descriptors.
Figure 8
Figure 8
Real examples of erroneous structure records in chemical databases leading to Dragon error messages.
Figure 9
Figure 9
Experimental bioavailability values (%) from QSARWorld competition (X-axis) vs WOMBAT (Y-axis) for 55 overlapping compounds.

Similar articles

See all similar articles

Cited by 129 articles

  • CoMPARA: Collaborative Modeling Project for Androgen Receptor Activity.
    Mansouri K, Kleinstreuer N, Abdelaziz AM, Alberga D, Alves VM, Andersson PL, Andrade CH, Bai F, Balabin I, Ballabio D, Benfenati E, Bhhatarai B, Boyer S, Chen J, Consonni V, Farag S, Fourches D, García-Sosa AT, Gramatica P, Grisoni F, Grulke CM, Hong H, Horvath D, Hu X, Huang R, Jeliazkova N, Li J, Li X, Liu H, Manganelli S, Mangiatordi GF, Maran U, Marcou G, Martin T, Muratov E, Nguyen DT, Nicolotti O, Nikolov NG, Norinder U, Papa E, Petitjean M, Piir G, Pogodin P, Poroikov V, Qiao X, Richard AM, Roncaglioni A, Ruiz P, Rupakheti C, Sakkiah S, Sangion A, Schramm KW, Selvaraj C, Shah I, Sild S, Sun L, Taboureau O, Tang Y, Tetko IV, Todeschini R, Tong W, Trisciuzzi D, Tropsha A, Van Den Driessche G, Varnek A, Wang Z, Wedebye EB, Williams AJ, Xie H, Zakharov AV, Zheng Z, Judson RS. Mansouri K, et al. Environ Health Perspect. 2020 Feb;128(2):27002. doi: 10.1289/EHP5580. Epub 2020 Feb 7. Environ Health Perspect. 2020. PMID: 32074470 Free PMC article.
  • Deep Learning-driven research for drug discovery: Tackling Malaria.
    Neves BJ, Braga RC, Alves VM, Lima MNN, Cassiano GC, Muratov EN, Costa FTM, Andrade CH. Neves BJ, et al. PLoS Comput Biol. 2020 Feb 18;16(2):e1007025. doi: 10.1371/journal.pcbi.1007025. eCollection 2020 Feb. PLoS Comput Biol. 2020. PMID: 32069285 Free PMC article.
  • Integrative Multi-Kinase Approach for the Identification of Potent Antiplasmodial Hits.
    Lima MNN, Cassiano GC, Tomaz KCP, Silva AC, Sousa BKP, Ferreira LT, Tavella TA, Calit J, Bargieri DY, Neves BJ, Costa FTM, Andrade CH. Lima MNN, et al. Front Chem. 2019 Nov 21;7:773. doi: 10.3389/fchem.2019.00773. eCollection 2019. Front Chem. 2019. PMID: 31824917 Free PMC article.
  • Construction of Quantitative Structure Activity Relationship (QSAR) Models to Predict Potency of Structurally Diversed Janus Kinase 2 Inhibitors.
    Simeon S, Jongkon N. Simeon S, et al. Molecules. 2019 Dec 1;24(23):4393. doi: 10.3390/molecules24234393. Molecules. 2019. PMID: 31805692 Free PMC article.
  • Big Data and Artificial Intelligence Modeling for Drug Discovery.
    Zhu H. Zhu H. Annu Rev Pharmacol Toxicol. 2020 Jan 6;60:573-589. doi: 10.1146/annurev-pharmtox-010919-023324. Epub 2019 Sep 13. Annu Rev Pharmacol Toxicol. 2020. PMID: 31518513
See all "Cited by" articles

Publication types

MeSH terms

Substances

LinkOut - more resources

Feedback