Development of benchmark datasets for text mining and sentiment analysis to accelerate regulatory literature review

Leihong Wu; Si Chen; Lei Guo; Svitlana Shpyleva; Kelly Harris; Tariq Fahmi; Timothy Flanigan; Weida Tong; Joshua Xu; Zhen Ren

doi:10.1016/j.yrtph.2022.105287

Development of benchmark datasets for text mining and sentiment analysis to accelerate regulatory literature review

Regul Toxicol Pharmacol. 2023 Jan:137:105287. doi: 10.1016/j.yrtph.2022.105287. Epub 2022 Nov 11.

Authors

Leihong Wu¹, Si Chen², Lei Guo², Svitlana Shpyleva², Kelly Harris³, Tariq Fahmi⁴, Timothy Flanigan⁵, Weida Tong⁶, Joshua Xu⁶, Zhen Ren⁷

Affiliations

¹ Division of Bioinformatics and Biostatics, National Center for Toxicological Research, U.S. FDA, Jefferson, AR, 72079, USA. Electronic address: leihong.wu@fda.hhs.gov.
² Division of Biochemical Toxicology, National Center for Toxicological Research, U.S. FDA, Jefferson, AR, 72079, USA.
³ Division of Genetic and Molecular Toxicology, National Center for Toxicological Research, U.S. FDA, Jefferson, AR, 72079, USA.
⁴ Office of Scientific Coordination, National Center for Toxicological Research, U.S. FDA, Jefferson, AR, 72079, USA.
⁵ Division of Neurotoxicology, National Center for Toxicological Research, U.S. FDA, Jefferson, AR, 72079, USA.
⁶ Division of Bioinformatics and Biostatics, National Center for Toxicological Research, U.S. FDA, Jefferson, AR, 72079, USA.
⁷ Division of Biochemical Toxicology, National Center for Toxicological Research, U.S. FDA, Jefferson, AR, 72079, USA. Electronic address: zhen.ren@fda.hhs.gov.

PMID: 36372266
DOI: 10.1016/j.yrtph.2022.105287

Abstract

In the field of regulatory science, reviewing literature is an essential and important step, which most of the time is conducted by manually reading hundreds of articles. Although this process is highly time-consuming and labor-intensive, most output of this process is not well transformed into machine-readable format. The limited availability of data has largely constrained the artificial intelligence (AI) system development to facilitate this literature reviewing in the regulatory process. In the past decade, AI has revolutionized the area of text mining as many deep learning approaches have been developed to search, annotate, and classify relevant documents. After the great advancement of AI algorithms, a lack of high-quality data instead of the algorithms has recently become the bottleneck of AI system development. Herein, we constructed two large benchmark datasets, Chlorine Efficacy dataset (CHE) and Chlorine Safety dataset (CHS), under a regulatory scenario that sought to assess the antiseptic efficacy and toxicity of chlorine. For each dataset, ∼10,000 scientific articles were initially collected, manually reviewed, and their relevance to the review task were labeled. To ensure high data quality, each paper was labeled by a consensus among multiple experienced reviewers. The overall relevance rate was 27.21% (2,663 of 9,788) for CHE and 7.50% (761 of 10,153) for CHS, respectively. Furthermore, the relevant articles were categorized into five subgroups based on the focus of their content. Next, we developed an attention-based classification language model using these two datasets. The proposed classification model yielded 0.857 and 0.908 of Area Under the Curve (AUC) for CHE and CHS dataset, respectively. This performance was significantly better than permutation test (p < 10E-9), demonstrating that the labeling processes were valid. To conclude, our datasets can be used as benchmark to develop AI systems, which can further facilitate the literature review process in regulatory science.

Keywords: Artificial intelligence; Benchmark dataset; Literature analysis; Regulatory review; Text mining.

Published by Elsevier Inc.

MeSH terms

Artificial Intelligence*
Benchmarking
Chlorine
Data Mining
Machine Learning*
Sentiment Analysis

Substances

Chlorine