The identification and classification of carcinogens is critical in cancer epidemiology, necessitating updated methodologies to manage the burgeoning biomedical literature. We introduce the Carcinogen Detection via Transformers (CarD-T) framework, combining transformer-based machine learning with probabilistic analysis to efficiently nominate potential carcinogens from scientific texts. Trained on 60% of established carcinogens, CarD-T correctly identifies all remaining known carcinogens and nominates ∼1600 potential new carcinogens. Comparative assessment against GPT-4 reveals CarD-T's comparable precision (0.896 versus 0.903), and superior recall (0.853 versus 0.757), implying an improved ability to nominate potential carcinogens for further evaluation. CarD-T associates each nominated entity with relevant scientific literature, allowing for additional analysis of conflicting implications over time through a Bayesian probabilistic carcinogen denomination analysis. The framework also provides rich insights into carcinogenesis associated research, revealing significant shifts in research focus on carcinogenic agents over time, from chemical carcinogens to broader categories including biological agents, environmental factors and lifestyle choices. We establish the CarD-T framework as a locally deployable, computationally inexpensive, and robust tool for identifying and nominating potential carcinogens from vast biomedical literature. This framework enhances the agility of public health responses to carcinogen identification, setting a new benchmark for automated, scalable toxicological investigations.
Keywords: automated literature review; biomedical language models; cancer epidemiology; carcinogen literature review; public health; temporal research modeling; toxicology.
CarD-T is a low-cost, transformer-based AI that efficiently identifies potential carcinogens with superior recall over a state-of-the-art LLM. It enables probabilistic classification, tracks research trends, and enhances public health responses by rapidly analyzing biomedical literature with minimal computational resources.
© The Author(s) 2025. Published by Oxford University Press.